Compare commits

...

367 Commits

Author SHA1 Message Date
David Myriel
9e278fc5a6 fix small details 2025-05-05 23:03:17 +02:00
David Myriel
09fed1f286 add quickstart doc 2025-05-05 22:02:11 +02:00
Will Jones
cee2b5ea42 chore: upgrade pyarrow pin (#2192)
Closes #2191


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated the required version of the pyarrow package to version 16 or
higher.
- Adjusted automated testing workflows to install pyarrow version 16 for
compatibility checks.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-05-05 11:23:13 -07:00
Alex Pilon
f315f9665a feat: implement bindings to return merge stats (#2367)
Based on this comment:
https://github.com/lancedb/lancedb/issues/2228#issuecomment-2730463075
and https://github.com/lancedb/lance/pull/2357

Here is my attempt at implementing bindings for returning merge stats
from a `merge_insert.execute` call for lancedb.

Note: I have almost no idea what I am doing in Rust but tried to follow
existing code patterns and pay attention to compiler hints.
- The change in nodejs binding appeared to be necessary to get
compilation to work, presumably this could actual work properly by
returning some kind of NAPI JS object of the stats data?
- I am unsure of what to do with the remote/table.rs changes -
necessarily for compilation to work; I assume this is related to LanceDB
cloud, but unsure the best way to handle that at this point.

Proof of function:

```python
import pandas as pd
import lancedb


db = lancedb.connect("/tmp/test.db")

test_data = pd.DataFrame(
    {
        "title": ["Hello", "Test Document", "Example", "Data Sample", "Last One"],
        "id": [1, 2, 3, 4, 5],
        "content": [
            "World",
            "This is a test",
            "Another example",
            "More test data",
            "Final entry",
        ],
    }
)

table = db.create_table("documents", data=test_data, exist_ok=True, mode="overwrite")

update_data = pd.DataFrame(
    {
        "title": [
            "Hello, World",
            "Test Document, it's good",
            "Example",
            "Data Sample",
            "Last One",
            "New One",
        ],
        "id": [1, 2, 3, 4, 5, 6],
        "content": [
            "World",
            "This is a test",
            "Another example",
            "More test data",
            "Final entry",
            "New content",
        ],
    }
)

stats = (
    table.merge_insert(on="id")
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(update_data)
)

print(stats)
```

returns

```
{'num_inserted_rows': 1, 'num_updated_rows': 5, 'num_deleted_rows': 0}
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Merge-insert operations now return detailed statistics, including
counts of inserted, updated, and deleted rows.
- **Bug Fixes**
- Tests updated to validate returned merge-insert statistics for
accuracy.
- **Documentation**
- Method documentation improved to reflect new return values and clarify
merge operation results.
- Added documentation for the new `MergeStats` interface detailing
operation statistics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-05-01 10:00:20 -07:00
Andrew C. Oliver
5deb26bc8b fix: prevent embedded objects from returning null in all of their fields (#2355)
metadata{filename=xyz} filename would be there structurally, but ALWAYS
null.

I didn't include this as a file but it may be useful for understanding
the problem for people searching on this issue so I'm including it here
as documentation. Before this patch any field that is more than 1 deep
is accepted but returns null values for subfields when queried.

```js
const lancedb = require('@lancedb/lancedb');

// Debug logger
function debug(message, data) {
  console.log(`[TEST] ${message}`, data !== undefined ? data : '');
}

// Log when our unwrapArrowObject is called
const kParent = Symbol.for("parent");
const kRowIndex = Symbol.for("rowIndex");

// Override console.log for our test
const originalConsoleLog = console.log;
console.log = function() {
  // Filter out noisy logs
  if (arguments[0] && typeof arguments[0] === 'string' && arguments[0].includes('[INFO] [LanceDB]')) {
    originalConsoleLog.apply(console, arguments);
  }
  originalConsoleLog.apply(console, arguments);
};

async function main() {
  debug('Starting test...');
  
  // Connect to the database
  debug('Connecting to database...');
  const db = await lancedb.connect('./.lancedb');
  
  // Try to open an existing table, or create a new one if it doesn't exist
  let table;
  try {
    table = await db.openTable('test_nested_fields');
    debug('Opened existing table');
  } catch (e) {
    debug('Creating new table...');
    
    // Create test data with nested metadata structure
    const data = [
      {
        id: 'test1',
        vector: [1, 2, 3],
        metadata: {
          filePath: "/path/to/file1.ts",
          startLine: 10,
          endLine: 20,
          text: "function test() { return true; }"
        }
      },
      {
        id: 'test2',
        vector: [4, 5, 6],
        metadata: {
          filePath: "/path/to/file2.ts",
          startLine: 30,
          endLine: 40,
          text: "function test2() { return false; }"
        }
      }
    ];
    
    debug('Data to be inserted:', JSON.stringify(data, null, 2));
    
    // Create the table
    table = await db.createTable('test_nested_fields', data);
    debug('Table created successfully');
  }
  
  // Query the table and get results
  debug('Querying table...');
  const results = await table.search([1, 2, 3]).limit(10).toArray();
  
  // Log the results
  debug('Number of results:', results.length);
  
  if (results.length > 0) {
    const firstResult = results[0];
    debug('First result properties:', Object.keys(firstResult));
    
    // Check if metadata is accessible and what properties it has
    if (firstResult.metadata) {
      debug('Metadata properties:', Object.keys(firstResult.metadata));
      debug('Metadata filePath:', firstResult.metadata.filePath);
      debug('Metadata startLine:', firstResult.metadata.startLine);
      
      // Destructure to see if that helps
      const { filePath, startLine, endLine, text } = firstResult.metadata;
      debug('Destructured values:', { filePath, startLine, endLine, text });
      
      // Check if it's a proxy object
      debug('Result is proxy?', Object.getPrototypeOf(firstResult) === Object.prototype ? false : true);
      debug('Metadata is proxy?', Object.getPrototypeOf(firstResult.metadata) === Object.prototype ? false : true);
    } else {
      debug('Metadata is not accessible!');
    }
  }
  
  // Close the database
  await db.close();
}

main().catch(e => {
  console.error('Error:', e);
}); 
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of nested struct fields to ensure accurate
preservation of values during serialization and deserialization.
- Enhanced robustness when accessing nested object properties, reducing
errors with missing or null values.

- **Tests**
- Added tests to verify correct handling of nested struct fields through
serialization and deserialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-05-01 09:38:55 -07:00
Lance Release
3cc670ac38 Updating package-lock.json 2025-04-29 23:21:19 +00:00
Lance Release
4ade3e31e2 Updating package-lock.json 2025-04-29 22:19:46 +00:00
Lance Release
a222d2cd91 Updating package-lock.json 2025-04-29 22:19:30 +00:00
Lance Release
508e621f3d Bump version: 0.19.1-beta.0 → 0.19.1-beta.1 2025-04-29 22:19:14 +00:00
Lance Release
a1a0472f3f Bump version: 0.22.1-beta.0 → 0.22.1-beta.1 2025-04-29 22:18:53 +00:00
Wyatt Alt
3425a6d339 feat: upgrade lance to v0.27.0-beta.2 (#2364)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated dependencies for related components to use the latest version
from a specific repository source. No changes to features or public
functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-29 14:59:56 -07:00
Ryan Green
af54e0ce06 feat: add table stats API (#2363)
* Add a new "table stats" API to expose basic table and fragment
statistics with local and remote table implementations

### Questions
* This is using `calculate_data_stats` to determine total bytes in the
table. This seems like a potentially expensive operation - are there any
concerns about performance for large datasets?

### Notes
* bytes_on_disk seems to be stored at the column level but there does
not seem to be a way to easily calculate total bytes per fragment. This
may need to be added in lance before we can support fragment size
(bytes) statistics.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added a method to retrieve comprehensive table statistics, including
total rows, index counts, storage size, and detailed fragment size
metrics such as minimum, maximum, mean, and percentiles.
- Enabled fetching of table statistics from remote sources through
asynchronous requests.
- Extended table interfaces across Python, Rust, and Node.js to support
synchronous and asynchronous retrieval of table statistics.
- **Tests**
- Introduced tests to verify the accuracy of the new table statistics
feature for both populated and empty tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-29 15:19:08 -02:30
Lance Release
089905fe8f Updating package-lock.json 2025-04-28 19:13:36 +00:00
Lance Release
554939e5d2 Updating package-lock.json 2025-04-28 17:20:58 +00:00
Lance Release
7a13814922 Updating package-lock.json 2025-04-28 17:20:42 +00:00
Lance Release
e9f25f6a12 Bump version: 0.19.0 → 0.19.1-beta.0 2025-04-28 17:20:26 +00:00
Lance Release
419a433244 Bump version: 0.22.0 → 0.22.1-beta.0 2025-04-28 17:20:10 +00:00
LuQQiu
a9311c4dc0 feat: add list/create/delete/update/checkout tag API (#2353)
add the tag related API to list existing tags, attach tag to a version,
update the tag version, delete tag, get the version of the tag, and
checkout the version that the tag bounded to.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced table version tagging, allowing users to create, update,
delete, and list human-readable tags for specific table versions.
  - Enabled checking out a table by either version number or tag name.
- Added new interfaces for tag management in both Python and Node.js
APIs, supporting synchronous and asynchronous workflows.

- **Bug Fixes**
  - None.

- **Documentation**
- Updated documentation to describe the new tagging features, including
usage examples.

- **Tests**
- Added comprehensive tests for tag creation, updating, deletion,
listing, and version checkout by tag in both Python and Node.js
environments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-28 10:04:46 -07:00
LuQQiu
178bcf9c90 fix: hybrid search explain plan analyze plan (#2360)
Fix hybrid search explain plan analyze plan API

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added options to view the execution plan and analyze the runtime
performance of hybrid queries.
- **Refactor**
- Improved internal handling of query setup for better modularity and
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-27 18:39:43 -07:00
Lance Release
b9be092cb1 Updating package-lock.json 2025-04-25 22:05:57 +00:00
Lance Release
e8c0c52315 Updating package-lock.json 2025-04-25 21:17:03 +00:00
Lance Release
a60fa0d3b7 Updating package-lock.json 2025-04-25 21:16:48 +00:00
Lance Release
726d629b9b Bump version: 0.19.0-beta.12 → 0.19.0 2025-04-25 21:16:30 +00:00
Lance Release
b493f56dee Bump version: 0.19.0-beta.11 → 0.19.0-beta.12 2025-04-25 21:16:25 +00:00
Lance Release
a8b5ad7e74 Bump version: 0.22.0-beta.12 → 0.22.0 2025-04-25 21:16:07 +00:00
Lance Release
f8f6264883 Bump version: 0.22.0-beta.11 → 0.22.0-beta.12 2025-04-25 21:16:07 +00:00
Will Jones
d8517117f1 feat: upgrade Lance to v0.26.0 (#2359)
Upstream changelog:
https://github.com/lancedb/lance/releases/tag/v0.26.0

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated dependency management to use published crate versions for
improved reliability and maintainability.
- Added a temporary workaround for build issues by pinning a specific
version of a dependency.
- **Refactor**
- Improved resource management and concurrency by updating internal
ownership models for object storage components.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-25 13:59:12 -07:00
Lance Release
ab66dd5ed2 Updating package-lock.json 2025-04-25 06:04:06 +00:00
Lance Release
cbb9a7877c Updating package-lock.json 2025-04-25 05:02:47 +00:00
Lance Release
b7fc223535 Updating package-lock.json 2025-04-25 05:02:32 +00:00
Lance Release
1fdaf7a1a4 Bump version: 0.19.0-beta.10 → 0.19.0-beta.11 2025-04-25 05:02:16 +00:00
Lance Release
d11819c90c Bump version: 0.22.0-beta.10 → 0.22.0-beta.11 2025-04-25 05:01:57 +00:00
BubbleCal
9b902272f1 fix: sync hybrid search ignores the distance range params (#2356)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added support for distance range filtering in hybrid vector queries,
allowing users to specify lower and upper bounds for search results.

- **Tests**
- Introduced new tests to validate distance range filtering and
reranking in both synchronous and asynchronous hybrid query scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-04-25 13:01:22 +08:00
Will Jones
8c0622fa2c fix: remote limit to avoid "Limit must be non-negative" (#2354)
To workaround this issue: https://github.com/lancedb/lancedb/issues/2211

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Improved handling of large query parameters to prevent potential
overflow issues when using the "k" parameter in queries.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-24 15:04:06 -07:00
Philip Meier
2191f948c3 fix: add missing pydantic model config compat (#2316)
Fixes #2315.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Enhanced query processing to maintain smooth functionality across
different dependency versions, ensuring improved stability and
performance.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-22 14:46:10 -07:00
Will Jones
acc3b03004 ci: fix docs deploy (#2351)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Improved CI workflow for documentation builds by optimizing Rust build
settings and updating the runner environment.
  - Fixed a typo in a workflow step name.
- Streamlined caching steps to reduce redundancy and improve efficiency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-22 13:55:34 -07:00
Lance Release
7f091b8c8e Updating package-lock.json 2025-04-22 19:16:43 +00:00
Lance Release
c19bdd9a24 Updating package-lock.json 2025-04-22 18:24:16 +00:00
Lance Release
dad0ff5cd2 Updating package-lock.json 2025-04-22 18:23:59 +00:00
Lance Release
a705621067 Bump version: 0.19.0-beta.9 → 0.19.0-beta.10 2025-04-22 18:23:39 +00:00
Lance Release
39614fdb7d Bump version: 0.22.0-beta.9 → 0.22.0-beta.10 2025-04-22 18:23:17 +00:00
Ryan Green
96d534d4bc feat: add retries to remote client for requests with stream bodies (#2349)
Closes https://github.com/lancedb/lancedb/issues/2307
* Adds retries to remote operations with stream bodies (add,
merge_insert)
* Change default retryable status codes to 409, 429, 500, 502, 503, 504
* Don't retry add or merge_insert operations on 5xx responses

Notes:
* Supporting retries on stream bodies means we have to buffer the body
into memory so it can be cloned on retry. This will impact memory use
patterns for the remote client. This buffering can be disabled by
disabling retries (i.e. setting retries to 0 in RetryConfig)
* It does not seem that retry config can be specified by env vars as the
documentation suggests. I added a follow-up issue
[here](https://github.com/lancedb/lancedb/issues/2350)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Enhanced retry support for remote requests with configurable limits
and exponential backoff with jitter.
- Added robust retry logic for streaming data uploads, enabling retries
with buffered data to ensure reliability.

- **Bug Fixes**
- Improved error handling and retry behavior for HTTP status codes 409
and 504.

- **Refactor**
- Centralized and modularized HTTP request sending and retry logic
across remote database and table operations.
  - Streamlined request ID management for improved traceability.
- Simplified error message construction in index waiting functionality.

- **Tests**
  - Added a test verifying merge-insert retries on HTTP 409 responses.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-22 15:40:44 -02:30
Lance Release
5051d30d09 Updating package-lock.json 2025-04-21 23:55:43 +00:00
Lance Release
db853c4041 Updating package-lock.json 2025-04-21 22:50:56 +00:00
Lance Release
76d1d22bdc Updating package-lock.json 2025-04-21 22:50:40 +00:00
Lance Release
d8746c61c6 Bump version: 0.19.0-beta.8 → 0.19.0-beta.9 2025-04-21 22:50:20 +00:00
Lance Release
1a66df2627 Bump version: 0.22.0-beta.8 → 0.22.0-beta.9 2025-04-21 22:49:59 +00:00
Will Jones
44670076c1 fix: move timeout to avoid retries (#2347)
I added a timeout to query execution options in
https://github.com/lancedb/lancedb/pull/2288. However, this was send to
the request timeout, but the retry implementation is unaware of this
timeout. So once the query timed out, a retry would be triggered.
Instead, this PR changes it so the timeout happens outside the retry
loop.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Improved query timeout handling to provide clearer error messages and
more reliable cancellation if a query takes too long to complete.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-21 14:27:04 -07:00
Will Jones
92f0b16e46 fix(python): make sure pandas is optional (#2346)
Fixes #2344


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Tests**
- Updated tests to use PyArrow Tables instead of pandas DataFrames where
possible, reducing reliance on pandas.
- Tests that require pandas are now automatically skipped if pandas is
not installed.
- **Chores**
- Improved workflow to uninstall both pylance and pandas in a specific
test step.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-21 13:42:13 -07:00
Eileen Noonan
1620ba3508 docs: make table.update() nodejs guide consistent with API documentation (#2334)
The docs in the Guide here do not match the [API reference]
(https://lancedb.github.io/lancedb/js/classes/Table/#updateopts) for the
nodejs client.

I am writing an Elixir wrapper over the typescript library (Rust
forthcoming!) and confirmed in testing that the API reference is correct
vs the Guide.

Following the Guide docs, the error I got was:

"lance error: Invalid user input: Schema error: No field named bar.
Valid fields are foo. For a query of:

await table.update({foo: "buzz"}, { where: "foo = 'bar'"});
Over a table with a schema of just {foo: Utf8}.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Reformatted a code snippet in the guide to enhance readability by
splitting it into multiple lines for improved clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-21 08:38:16 -07:00
Ryan Green
3ae90dde80 feat: add new table API to wait for async indexing (#2338)
* Add new wait_for_index() table operation that polls until indices are
created/fully indexed
* Add an optional wait timeout parameter to all create_index operations
* Python and NodeJS interfaces

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Summary by CodeRabbit

- **New Features**
- Added optional waiting for index creation completion with configurable
timeout.
- Introduced methods to poll and wait for indices to be fully built
across sync and async tables.
  - Extended index creation APIs to accept a wait timeout parameter.
- **Bug Fixes**
- Added a new timeout error variant for improved error reporting on
index operations.
- **Tests**
- Added tests covering successful index readiness waiting, timeout
scenarios, and missing index cases.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-21 08:41:21 -02:30
Magnus
4f07fea6df feat: add ColPali embedding support with MultiVector type (#2170)
This PR adds ColPali support with ColPaliEmbeddings class (tagged
"colpali") using ColQwen2.5 for multi-vector text/image embeddings. Also
added MultiVector Pydantic type to handle the vector lists.

I've added some integration test for the embedding model and some unit
test for the new Pydantic type. Could be a template for other ColPali
variants as well. or until transformers🤗 starts supporting it.


Still `TODO`:

- [ ] Documentation
- [ ] Add an example

_Could also allow Image as query, but didn't work well when testing it._

[ColPali-Engine](https://github.com/illuin-tech/colpali) version:
0.3.9.dev17+g3faee24

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced support for ColPali-based multimodal multi-vector
embeddings for both text and images.
- Added a new embedding class for generating multi-vector embeddings,
configurable for various model and processing options.
- Added a new Pydantic type for multi-vector embeddings, supporting
validation and schema generation for lists of fixed-dimension vectors.

- **Bug Fixes**
- Ensured proper asynchronous index creation in query tests for improved
reliability.

- **Tests**
- Added integration tests for ColPali embeddings, including
text-to-image search and validation of multi-vector fields.
- Added comprehensive tests for the new multi-vector Pydantic type,
covering schema, validation, and default value behavior.

- **Chores**
  - Updated optional dependencies to include the ColPali engine.
  - Added utility to check for availability of flash attention support.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-21 11:47:37 +08:00
Lance Release
3d7d82cf86 Updating package-lock.json 2025-04-17 23:13:37 +00:00
Lance Release
edc4e40a7b Updating package-lock.json 2025-04-17 22:16:36 +00:00
Lance Release
ca3806a02f Updating package-lock.json 2025-04-17 22:16:20 +00:00
Lance Release
35cff12e31 Bump version: 0.19.0-beta.7 → 0.19.0-beta.8 2025-04-17 22:16:02 +00:00
Lance Release
c6c20cb2bd Bump version: 0.22.0-beta.7 → 0.22.0-beta.8 2025-04-17 22:15:46 +00:00
Weston Pace
26080ee4c1 feat: add prewarm_index function (#2342)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added the ability to prewarm (load into memory) table indexes via new
methods in Python, Node.js, and Rust APIs, potentially reducing
cold-start query latency.
- **Bug Fixes**
- Ensured prewarming an index does not interfere with subsequent search
operations.
- **Tests**
- Introduced new test cases to verify full-text search index creation,
prewarming, and search functionalities in both Python and Node.js.
- **Chores**
  - Updated dependencies for improved compatibility and performance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Lu Qiu <luqiujob@gmail.com>
2025-04-17 15:14:36 -07:00
Guspan Tanadi
ef3a2b5357 docs: intended path relative links (#2321)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated the link in the documentation to correctly reference the
workflow file, ensuring accurate navigation from the current context.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
2025-04-16 13:12:09 -07:00
Adam Azzam
c42a201389 docs: remove trailing commas from AWS IAM Policies (#2324)
Before:

<img width="1173" alt="Screenshot 2025-04-08 at 10 58 50 AM"
src="https://github.com/user-attachments/assets/e5c69c45-ab68-488f-9c7f-e12f7ecbfaab"
/>

After:
<img width="1136" alt="Screenshot 2025-04-08 at 10 58 58 AM"
src="https://github.com/user-attachments/assets/108c11ea-09b3-49b5-9a50-b880e72a0270"
/>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Documentation**
- Updated JSON policy examples in the storage guides to correct
formatting issues and enhance syntax clarity for readers.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-16 13:09:21 -07:00
Lance Release
24e42ccd4d Updating package-lock.json 2025-04-15 05:29:37 +00:00
Lance Release
8a50944061 Updating package-lock.json 2025-04-15 04:11:16 +00:00
Lance Release
40e066bc7c Updating package-lock.json 2025-04-15 04:11:00 +00:00
Lance Release
b3ad105fa0 Bump version: 0.19.0-beta.6 → 0.19.0-beta.7 2025-04-15 04:10:43 +00:00
Lance Release
6e701d3e1b Bump version: 0.22.0-beta.6 → 0.22.0-beta.7 2025-04-15 04:10:26 +00:00
BubbleCal
2248aa9508 fix: bugs for new FTS APIs (#2314)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced full-text search capabilities with support for phrase
queries, fuzzy matching, boosting, and multi-column matching.
- Search methods now accept full-text query objects directly, improving
query flexibility and precision.
- Python and JavaScript SDKs updated to handle full-text queries
seamlessly, including async search support.

- **Tests**
- Added comprehensive tests covering fuzzy search, phrase search, and
boosted queries to ensure robust full-text search functionality.

- **Documentation**
- Updated query class documentation to reflect new constructor options
and removal of deprecated methods for clarity and simplicity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-04-15 11:51:35 +08:00
PhorstenkampFuzzy
a6fa69ab89 fix(python): add pylance as its own optional dependency (#2336)
This change allows to centrally manage the plance depndency without
everybody needing to monitor for compatibility manually.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced an optional dependency that enhances development support.
Users can now benefit from improved static analysis capabilities when
installing the recommended version (0.23.2 or later).

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-14 09:28:16 -07:00
Will Jones
b3a4efd587 fix: revert change default read_consistency_interval=5s (#2327)
This reverts commit a547c523c2 or #2281

The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.

- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
  - Simplified internal logic to offer a more consistent experience.

- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-04-14 08:48:15 -07:00
Lei Xu
4708b60bb1 chore: cargo update on main (#2331)
Fix test failures on main
2025-04-12 09:00:47 -05:00
Lei Xu
080ea2f9a4 chore: fix 1.86 warnings (#2312)
Fix rust 1.86 warnings
2025-04-12 08:29:10 -05:00
Ayush Chaurasia
32fdde23f8 fix: robust handling of empty result when reranking (#2313)
I found some edge cases while running experiments that - depending on
the base reranking libraries, some of them don't handle empty lists
well. This PR manually checks if the result set to be reranked is empty

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Bug Fixes**
- Enhanced search result processing by ensuring that reordering only
occurs when valid, non-empty results are available, thereby preventing
unnecessary operations and potential errors.

- **Tests**
- Added automated tests to verify that empty search result sets are
handled correctly, ensuring consistent behavior across various
rerankers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-09 16:26:05 +05:30
Lance Release
c44e5c046c Updating package-lock.json 2025-04-08 07:01:33 +00:00
Lance Release
f23aa0a793 Updating package-lock.json 2025-04-08 06:17:03 +00:00
Lance Release
83fc2b1851 Updating package-lock.json 2025-04-08 06:16:48 +00:00
Lance Release
56aa133ee6 Bump version: 0.19.0-beta.5 → 0.19.0-beta.6 2025-04-08 06:16:30 +00:00
Lance Release
27d9e5c596 Bump version: 0.22.0-beta.5 → 0.22.0-beta.6 2025-04-08 06:16:14 +00:00
BubbleCal
ec8271931f feat: support to create FTS index on list of strings (#2317)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated internal library dependencies to the latest beta version for
improved system stability.
- **Tests**
- Added automated tests to validate full-text search functionality on
list-based text fields.
- **Refactor**
- Enhanced the search processing logic to provide robust support for
list-type text data, ensuring more reliable results.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-04-08 14:12:35 +08:00
Lance Release
6c6966600c Updating package-lock.json 2025-04-04 22:56:57 +00:00
Lance Release
2e170c3c7b Updating package-lock.json 2025-04-04 21:50:28 +00:00
Lance Release
fd92e651d1 Updating package-lock.json 2025-04-04 21:50:12 +00:00
Lance Release
c298482ee1 Bump version: 0.19.0-beta.4 → 0.19.0-beta.5 2025-04-04 21:49:53 +00:00
Lance Release
d59f64b5a3 Bump version: 0.22.0-beta.4 → 0.22.0-beta.5 2025-04-04 21:49:34 +00:00
fzowl
30ed8c4c43 fix: voyageai regression multimodal supercedes text models (#2268)
fix #2160
2025-04-04 14:45:56 -07:00
Will Jones
4a2cdbf299 ci: provide token for deprecate call (#2309)
This should prevent the failures we are seeing in Node release.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chore**
- Enhanced the package deprecation process with improved security
measures, ensuring smoother and more reliable updates during package
deprecation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-04 14:44:58 -07:00
Will Jones
657843d9e9 perf: remove redundant checkout latest (#2310)
This bug was introduced in https://github.com/lancedb/lancedb/pull/2281

Likely introduced during a rebase when fixing merge conflicts.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Refactor**
- Updated the refresh process so that reloading now uses the existing
dataset version instead of automatically updating to the latest version.
This change may affect workflows that rely on immediate data updates
during refresh.
  
- **New Features**
- Introduced a new module for tracking I/O statistics in object store
operations, enhancing monitoring capabilities.
- Added a new test module to validate the functionality of the dataset
operations.

- **Bug Fixes**
- Reintroduced the `write_options` method in the `CreateTableBuilder`,
ensuring consistent functionality across different builder variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-04 12:56:02 -07:00
Will Jones
1cd76b8498 feat: add timeout to query execution options (#2288)
Closes #2287


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added configurable timeout support for query executions. Users can now
specify maximum wait times for queries, enhancing control over
long-running operations across various integrations.
- **Tests**
- Expanded test coverage to validate timeout behavior in both
synchronous and asynchronous query flows, ensuring timely error
responses when query execution exceeds the specified limit.
- Introduced a new test suite to verify query operations when a timeout
is reached, checking for appropriate error handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-04 12:34:41 -07:00
Lei Xu
a38f784081 chore: add numpy as dependency (#2308) 2025-04-04 10:33:39 -07:00
Will Jones
647dee4e94 ci: check release builds when we change dependencies (#2299)
The issue we fixed in https://github.com/lancedb/lancedb/pull/2296 was
caused by an upgrade in dependencies. This could have been caught if we
had run these CI jobs when we did the dependency change.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated our automated pipeline to trigger additional stability checks
when dependency configurations change, ensuring smoother build and
release processes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-03 16:19:00 -07:00
Lance Release
0844c2dd64 Updating package-lock.json 2025-04-02 21:23:50 +00:00
Lance Release
fd2692295c Updating package-lock.json 2025-04-02 21:23:34 +00:00
Lance Release
d4ea50fba1 Bump version: 0.19.0-beta.3 → 0.19.0-beta.4 2025-04-02 21:23:19 +00:00
Lance Release
0d42297cf8 Bump version: 0.22.0-beta.3 → 0.22.0-beta.4 2025-04-02 21:23:02 +00:00
Weston Pace
a6d4125cbf feat: upgrade lance to 0.25.3b2 (#2304)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
	- Updated core dependency versions to v0.25.3-beta.2.
	- Enabled additional functionality with a new "dynamodb" feature.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-02 14:22:30 -07:00
Lance Release
5c32a99e61 Updating package-lock.json 2025-04-02 09:28:46 +00:00
Lance Release
cefaa75b24 Updating package-lock.json 2025-04-02 09:28:30 +00:00
Lance Release
bd62c2384f Bump version: 0.19.0-beta.2 → 0.19.0-beta.3 2025-04-02 09:28:14 +00:00
Lance Release
f0bc08c0d7 Bump version: 0.22.0-beta.2 → 0.22.0-beta.3 2025-04-02 09:27:55 +00:00
BubbleCal
e52ac79c69 fix: can't do structured FTS in python (#2300)
missed to support it in `search()` API and there were some pydantic
errors

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced full-text search capabilities by incorporating additional
parameters, enabling more flexible query definitions.
- Extended table search functionality to support full-text queries
alongside existing search types.

- **Tests**
- Introduced new tests that validate both structured and conditional
full-text search behaviors.
- Expanded test coverage for various query types, including MatchQuery,
BoostQuery, MultiMatchQuery, and PhraseQuery.

- **Bug Fixes**
- Fixed a logic issue in query processing to ensure correct handling of
full-text search queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-04-02 17:27:15 +08:00
Will Jones
f091f57594 ci: fix lancedb musl builds (#2296)
Fixes #2255


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Enhanced the build process to improve performance and reliability
across Linux platforms.
  - Updated environment settings for more accurate compiler integration.
- Activated previously inactive build configurations to support advanced
feature support.
- Added support for the x86_64 architecture on Linux systems utilizing
the musl C library.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-01 14:44:27 -07:00
Lance Release
a997fd4108 Updating package-lock.json 2025-04-01 17:28:57 +00:00
Lance Release
1486514ccc Updating package-lock.json 2025-04-01 17:28:40 +00:00
Lance Release
a505bc3965 Bump version: 0.19.0-beta.1 → 0.19.0-beta.2 2025-04-01 17:28:21 +00:00
Lance Release
c1738250a3 Bump version: 0.22.0-beta.1 → 0.22.0-beta.2 2025-04-01 17:27:57 +00:00
Weston Pace
1ee63984f5 feat: allow FSB to be used for btree indices (#2297)
We recently allowed this for lance but there was a check in lancedb as
well that was preventing it

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added support for indexing fixed-size binary data using B-tree
structures for efficient data storage and retrieval.
- **Tests**
- Implemented automated tests to ensure the new binary indexing works
correctly and meets the expected configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-01 10:27:22 -07:00
Lance Release
2eb2c8862a Updating package-lock.json 2025-04-01 14:27:26 +00:00
Lance Release
4ea8e178d3 Updating package-lock.json 2025-04-01 14:27:07 +00:00
Lance Release
e4485a630e Bump version: 0.19.0-beta.0 → 0.19.0-beta.1 2025-04-01 14:26:47 +00:00
Lance Release
fb95f9b3bd Bump version: 0.22.0-beta.0 → 0.22.0-beta.1 2025-04-01 14:26:28 +00:00
Weston Pace
625bab3f21 feat: update to lance 0.25.3b1 (#2294)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated dependency versions for improved performance and
compatibility.

- **New Features**
- Added support for structured full-text search with expanded query
types (e.g., match, phrase, boost, multi-match) and flexible input
formats.
- Introduced a new method to check server support for structural
full-text search features.
- Enhanced the query system with new classes and interfaces for handling
various full-text queries.
- Expanded the functionality of existing methods to accept more complex
query structures, including updates to method signatures.

- **Bug Fixes**
  - Improved error handling and reporting for full-text search queries.

- **Refactor**
- Enhanced query processing with streamlined input handling and improved
error reporting, ensuring more robust and consistent search results
across platforms.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
2025-04-01 06:36:42 -07:00
Will Jones
e59f9382a0 ci: deprecate vectordb each release (#2292)
I released each time we published, the new package was no longer
deprecated. This re-deprecated the package after a new publish.
2025-03-31 12:03:04 -07:00
Lance Release
fdee7ba477 Updating package-lock.json 2025-03-30 19:09:17 +00:00
Lance Release
c44fa3abc4 Updating package-lock.json 2025-03-30 18:05:07 +00:00
Lance Release
fc43aac0ed Updating package-lock.json 2025-03-30 18:04:51 +00:00
Lance Release
e67cd0baf9 Bump version: 0.18.3-beta.0 → 0.19.0-beta.0 2025-03-30 18:04:32 +00:00
Lance Release
26dab93f2a Bump version: 0.21.3-beta.0 → 0.22.0-beta.0 2025-03-30 18:04:14 +00:00
LuQQiu
b9bdb8d937 fix: fix remote restore api to always checkout latest version (#2291)
Fix restore to always checkout latest version, following local restore
api implementation

a1d1833a40/rust/lancedb/src/table.rs (L1910)
Otherwise
table.create_table -> version 1
table.add_table -> version 2
table.checkout(1), table.restore() -> the version remains at 1 (should
checkout_latest inside restore method to update version to latest
version and allow write operation)
table.checkout_latest() -> version is 3
can do write operations
2025-03-29 22:46:57 -07:00
LuQQiu
a1d1833a40 feat: add analyze_plan api (#2280)
add analyze plan api to allow executing the queries and see runtime
metrics.
Which help identify the query IO overhead and help identify query
slowness
2025-03-28 14:28:52 -07:00
Will Jones
a547c523c2 feat!: change default read_consistency_interval=5s (#2281)
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.

Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
2025-03-28 11:04:31 -07:00
Lance Release
dc8b75feab Updating package-lock.json 2025-03-28 17:15:17 +00:00
Lance Release
c1600cdc06 Updating package-lock.json 2025-03-28 16:04:01 +00:00
Lance Release
f5dee46970 Updating package-lock.json 2025-03-28 16:03:46 +00:00
Lance Release
346cbf8bf7 Bump version: 0.18.2-beta.0 → 0.18.3-beta.0 2025-03-28 16:03:31 +00:00
Lance Release
3c7dfe9f28 Bump version: 0.21.2-beta.0 → 0.21.3-beta.0 2025-03-28 16:03:17 +00:00
Lei Xu
f52d05d3fa feat: add columns using pyarrow schema (#2284) 2025-03-28 08:51:50 -07:00
vinoyang
c321cccc12 chore(java): make rust release to be a switch option (#2277) 2025-03-28 11:26:24 +08:00
LuQQiu
cba14a5743 feat: add restore remote api (#2282) 2025-03-27 16:33:52 -07:00
vinoyang
72057b743d chore(java): introduce spotless plugin (#2278) 2025-03-27 10:38:39 +08:00
LuQQiu
698f329598 feat: add explain plan remote api (#2263)
Add explain plan remote api
2025-03-26 11:22:40 -07:00
BubbleCal
79fa745130 feat: upgrade lance to v0.25.1-beta.3 (#2276)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-26 23:14:27 +08:00
vinoyang
2ad71bdeca fix(java): make test work for jdk8 (#2269) 2025-03-25 10:57:49 -07:00
vinoyang
7c13615096 fix(java): add .gitignore file (#2270) 2025-03-25 10:56:08 -07:00
Wyatt Alt
f882f5b69a fix: update Query pydoc (#2273)
Removes reference of nonexistent method.
2025-03-25 08:50:23 -07:00
Benjamin Clavié
a68311a893 fix: answerdotai rerankers argument passing (#2117)
This fixes an issue for people wishing to use different kinds of
rerankers in lancedb via AnswerDotAI rerankers. Currently, the arguments
are passed sequentially, but they don't match the[Reranker class
implementation](d604a8c47d/rerankers/reranker.py (L179)):
the second argument is expected to be an optional "lang" for default
models, while model_type should be passed explicitly.

The one line changes in this PR fixes it and enables the use of other
methods (eg LLMs-as-rerankers)
2025-03-24 12:31:59 +05:30
Ayush Chaurasia
846a5cea33 fix: handle light and dark mode logo (#2265) 2025-03-22 10:21:05 -07:00
QianZhu
e3dec647b5 docs: replace banner as an image (#2262) 2025-03-21 18:35:35 -07:00
QianZhu
c58104cecc docs: add banner for LanceDB Cloud in public beta (#2261) 2025-03-21 17:54:34 -07:00
QianZhu
b3b5362632 docs: replace Lancedb Cloud link (#2259)
* direct users to cloud.lancedb.com since LanceDB Cloud is in public
beta
* removed the `cast vector dimension` from alter columns as we don't
support it
2025-03-21 17:43:00 -07:00
Will Jones
abe06fee3d feat(python): warn on fork (#2258)
Closes #768
2025-03-21 17:18:10 -07:00
Will Jones
93a82fd371 ci: allow dry run on PR to Python release (#2245)
This just makes it easier to test in the future.
2025-03-21 16:14:32 -07:00
Will Jones
0d379e6ffa ci(node): setup URL so auth token is picked up (#2257)
Should fix failure seen here:
https://github.com/lancedb/lancedb/actions/runs/13999958170/job/39207039825
2025-03-21 16:14:24 -07:00
Lance Release
e1388bdfdd Updating package-lock.json 2025-03-21 20:46:53 +00:00
Lance Release
315a24c2bc Updating package-lock.json 2025-03-21 20:03:43 +00:00
Lance Release
6dd4cf6038 Updating package-lock.json 2025-03-21 20:03:27 +00:00
Lance Release
f97e751b3c Bump version: 0.18.1 → 0.18.2-beta.0 2025-03-21 20:02:59 +00:00
Lance Release
e803a626a1 Bump version: 0.21.1 → 0.21.2-beta.0 2025-03-21 20:02:25 +00:00
Weston Pace
9403254442 feat: add to_query_object method (#2239)
This PR adds a `to_query_object` method to the various query builders
(except not hybrid queries yet). This makes it possible to inspect the
query that is built.

In addition this PR does some normalization between the sync and async
query paths. A few custom defaults were removed in favor of None (with
the default getting set once, in rust).

Also, the synchronous to_batches method will now actually stream results

Also, the remote API now defaults to prefiltering
2025-03-21 13:01:51 -07:00
Will Jones
b2a38ac366 fix: make pylance optional again (#2209)
The two remaining blockers were:

* A method `with_embeddings` that was deprecated a year ago
* A typecheck for `LanceDataset`
2025-03-21 11:26:32 -07:00
BubbleCal
bdb6c09c3b feat: support binary vector and IVF_FLAT in TypeScript (#2221)
resolve #2218

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-21 10:57:08 -07:00
Will Jones
2bfdef2624 ci: refactor node releases (#2223)
This PR fixes build issues associated with `aws-lc-rs`, while
simplifying the build process. Previously, we used custom scripts for
the musl and Windows ARM builds. These were complicated and prone to
breaking. This PR switches to a setup that mirrors
https://github.com/napi-rs/package-template/blob/main/.github/workflows/CI.yml.

* linux glibc and musl builds now use the Docker images provided by the
napi project
* Windows ARM build now just cross compiles from Windows x64, which
turns out to work quite well.
2025-03-21 10:56:29 -07:00
Samuel Colvin
7982d5c082 fix: correct rust install docs (#2253)
I'm pretty sure you mean `cargo add lancedb` here, `cargo install
lancedb` fails right now.
2025-03-21 10:12:53 -07:00
BubbleCal
7ff6ec7fe3 feat: upgrade to lance v0.25.0-beta.5 (#2248)
- adds `loss` into the index stats for vector index
- now `optimize` can retrain the vector index

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-21 10:12:23 -07:00
Ayush Chaurasia
ba1ded933a fix: add better check for empty results in hybrid search (#2252)
fixes: https://github.com/lancedb/lancedb/issues/2249
2025-03-21 13:05:05 +05:30
Will Jones
b595d8a579 fix(nodejs): workaround for apache-arrow null vector issue (#2244)
Fixes #2240
2025-03-20 08:07:10 -07:00
Will Jones
2a1d6d8abf ci: simplify windows builds (#2243)
We soon won't rely on cross compiling from Linux to windows, so can
remove this check. Instead, check that we can cross compile from Windows
between architectures.
2025-03-20 08:06:56 -07:00
Will Jones
440a466a13 ci: remove OpenSSL as dependency in favor of rustls (#2242)
`object_store` already hard codes `rustls` as the TLS implementation, so
we have been shipping a mix of `rustls` and `openssl`. For simplicity of
builds, we should consolidate to one, and that has to be `rustls`.
2025-03-20 08:06:45 -07:00
Ayush Chaurasia
b9afd9c860 docs: add late interaction, multi-vector guide & link example (#2231)
1/2 docs update for this week. Addesses issues from this docs epic -
https://github.com/lancedb/lancedb/issues/1476
2025-03-20 20:29:32 +05:30
Will Jones
a6b6f6a806 ci: drop vectordb support for musl, windows ARM (#2241)
vectordb is deprecated, and these platforms are particularly difficult
to maintain. Removing now to prevent further headaches.

We will keep these platforms supported on `@lancedb/lancedb`.
2025-03-19 12:23:46 -07:00
Ayush Chaurasia
ae1548b507 docs: add cloud & enterprise cta (#2235)
2/2 docs update this week
- Add cloud & enterprise CTA
- remove outdated projects/examples from landing page
2025-03-19 10:55:05 -07:00
Weston Pace
4e03ee82bc refactor: rework catalog/database options (#2213)
The `ConnectRequest` has a set of properties that only make sense for
listing databases / catalogs and a set of properties that only make
sense for remote databases.

This PR reduces all options to a single `HashMap<String, String>`. This
makes it easier to add new database / catalog implementations and makes
it clearer to users which options are applicable in which situations.

I don't believe there are any breaking changes here. The closest thing
is that I placed the `ConnectBuilder` methods `api_key`, `region`, and
`host_override` behind a `remote` feature gate. This is not strictly
needed and I could remove the feature gate but it seemed appropriate.
Since using these methods without the remote feature would have been
meaningless I don't feel this counts as a breaking change.

We could look at removing these methods entirely from the
`ConnectBuilder` (and encouraging users to use `RemoteDatabaseOptions`
instead) but I'm not sure how I feel about that.

Another approach we could take is to move these methods into a
`RemoteConnectBuilderExt` trait (and there could be a similar
`ListingConnectBuilderExt` trait to add methods for the listing database
/ catalog).

For now though my main goal is to simplify `ConnectRequest` as much as
possible (I see this being part of the key public API for database /
catalog integrations, similar to the `BaseTable`, `Catalog`, and
`Database` traits and I'd like it to be simple).
2025-03-18 10:13:59 -07:00
Weston Pace
46a6846d07 refactor: remove dataset reference from base table (#2226) 2025-03-17 06:27:33 -07:00
Will Jones
a207213358 fix: insert structs in non-alphabetical order (#2222)
Closes #2114

Starting in #1965, we no longer pass the table schema into
`pa.Table.from_pylist()`. This means PyArrow is choosing the order of
the struct subfields, and apparently it does them in alphabetical order.
This is fine in theory, since in Lance we support providing fields in
any order. However, before we pass it to Lance, we call
`pa.Table.cast()` to align column types to the table types.
`pa.Table.cast()` is strict about field order, so we need to create a
cast target schema that aligns with the input data. We were doing this
at the top-level fields, but weren't doing this in nested fields. This
PR adds support to do this for nested ones.
2025-03-13 14:46:05 -07:00
BubbleCal
6c321c694a feat: upgrade lance to 0.25.0-beta2 (#2220)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-13 14:12:54 -07:00
Bob Liu
5c00b2904c feat: add get dataset method on NativeTable (#2021)
I want to public the dataset method from native table, then I can use
more lance method like order_by which is not exposed in the lancedb
crate.
2025-03-13 11:15:28 -07:00
Gagan Bhullar
14677d7c18 fix: metric type inconsistency (#2122)
PR fixes #2113

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-03-12 10:28:37 -07:00
Martin Schorfmann
dd22a379b2 fix: use Self return type annotation for abstract query builder (#2127)
Hello LanceDB team,

while developing using `lancedb` as a library I encountered a typing
problem affecting IDE hints and completions during development.

---

## Current Situation

Currently, the abstract base class `lancedb.query:LanceQueryBuilder`
uses method chaining to build up the search parameters, where the
methods have `LanceQueryBuilder` as a return type hint.

This leads to two issues:
1. Implementing subclasses of `LanceQueryBuilder` need to override
methods to modify the return type hint, even when they don't need to
change its implementation, just to ensure adequate IDE hints and
completions.
2. When using method chaining the first method directly inherited from
the abstract `LanceQueryBuilder` causes the inferred type to switch back
to `LanceQueryBuilder`. So even when the type starts from
`lancdb.table:LanceTable.search(query_type="vector", ...)` and therefor
correctly is inferred as `LanceVectorQueryBuilder`, after calling e.g.
`LanceVectorQueryBuilder.limit(...)` it is seen as the abstract
`LanceQueryBuilder` from that point on.

### Example of current situation


![image](https://github.com/user-attachments/assets/09678727-8722-43bd-a8a2-67d9b5fc0db5)

## Proposed changes

I propose to change the return type hints of the corresponding methods
(including classmethod `create()`) in the abstract base class
`LanceQueryBuilder` from `LanceQueryBuilder` to `Self`.
`Self` is already imported in the module:

```py
    if sys.version_info >= (3, 11):
        from typing import Self
    else:
        from typing_extensions import Self
```

### Further possible changes

Additionally, the implementing subclasses could also change the return
type hints to `Self` to potentially allow for further inheritance
easily.
> [!NOTE]
> **However this is not part of this pull request as of writing.**

### Example after proposed changes


![image](https://github.com/user-attachments/assets/a9aea636-e426-477a-86ee-2dad3af2876f)

---

Best regards
Martin
2025-03-12 10:08:25 -07:00
Will Jones
7747c9bcbf feat(node): parse arrow types in alterColumns() (#2208)
Previously, users could only specify new data types in `alterColumns` as
strings:

```ts
await tbl.alterColumns([
  path: "price",
  dataType: "float"
]);
```

But this has some problems:

1. It wasn't clear what were valid types
2. It was impossible to specify nested types, like lists and vector
columns.

This PR changes it to take an Arrow data type, similar to how the Python
API works. This allows casting vector types:

```ts
await tbl.alterColumns([
  {
    path: "vector",
    dataType: new arrow.FixedSizeList(
      2,
      new arrow.Field("item", new arrow.Float16(), false),
    ),
  },
]);
```

Closes #2185
2025-03-12 09:57:36 -07:00
QianZhu
c9d6fc43a6 docs: use bypass_vector_index() instead of use_index=false (#2115) 2025-03-12 09:31:09 -07:00
Martin Schorfmann
581bcfbb88 docs: fix docstring of EmbeddingFunction (#2118)
Hello LanceDB team,

---

I have fixed a discrepancy in the class docstring of
`lancedb.embeddings.base:EmbeddingFunction` and made consistency
alignments to that docstring.

### Changes made

1. The docstring referred to the abstract method
`get_source_embeddings()`.
  This method does not exist in the repository at the current state.
I have changed the mention to refer to the actual abstract method
`compute_source_embeddings()`.
2. Also, I aligned the consistency within the ordered list which is
describing the methods to be implemented by concrete embedding
functions.

---

Thank you for developing this useful library. 👍

Best regards
Martin
2025-03-12 09:30:01 -07:00
vinoyang
3750639b5f feat(rust): add connect_catalog method to support connect catalog via url (#2177) 2025-03-12 05:19:03 -07:00
Lance Release
e744d54460 Updating package-lock.json 2025-03-11 14:00:55 +00:00
Lance Release
9d1ce4b5a5 Updating package-lock.json 2025-03-11 13:15:18 +00:00
Lance Release
729ce5e542 Updating package-lock.json 2025-03-11 13:15:03 +00:00
Lance Release
de6739e7ec Bump version: 0.18.1-beta.0 → 0.18.1 2025-03-11 13:14:49 +00:00
Lance Release
495216efdb Bump version: 0.18.0 → 0.18.1-beta.0 2025-03-11 13:14:44 +00:00
Lance Release
a3b45a4d00 Bump version: 0.21.1-beta.0 → 0.21.1 2025-03-11 13:14:30 +00:00
Lance Release
c316c2f532 Bump version: 0.21.0 → 0.21.1-beta.0 2025-03-11 13:14:29 +00:00
Weston Pace
3966b16b63 fix: restore pylance as mandatory dependency (#2204)
We attempted to make pylance optional in
https://github.com/lancedb/lancedb/pull/2156 but it appears this did not
quite work. Users are unable to use lancedb from a fresh install. This
reverts the optional-ness so we can get back in a working state while we
fix the issue.
2025-03-11 06:13:52 -07:00
Lance Release
5661cc15ac Updating package-lock.json 2025-03-10 23:53:56 +00:00
Lance Release
4e7220400f Updating package-lock.json 2025-03-10 23:13:52 +00:00
Lance Release
ae4928fe77 Updating package-lock.json 2025-03-10 23:13:36 +00:00
Lance Release
e80a405dee Bump version: 0.18.0-beta.1 → 0.18.0 2025-03-10 23:13:18 +00:00
Lance Release
a53e19e386 Bump version: 0.18.0-beta.0 → 0.18.0-beta.1 2025-03-10 23:13:13 +00:00
Lance Release
c0097c5f0a Bump version: 0.21.0-beta.2 → 0.21.0 2025-03-10 23:12:56 +00:00
Lance Release
c199708e64 Bump version: 0.21.0-beta.1 → 0.21.0-beta.2 2025-03-10 23:12:56 +00:00
Weston Pace
4a47150ae7 feat: upgrade to lance 0.24.1 (#2199) 2025-03-10 15:18:37 -07:00
Wyatt Alt
f86b20a564 fix: delete tables from DDB on drop_all_tables (#2194)
Prior to this commit, issuing drop_all_tables on a listing database with
an external manifest store would delete physical tables but leave
references behind in the manifest store. The table drop would succeed,
but subsequent creation of a table with the same name would fail with a
conflict.

With this patch, the external manifest store is updated to account for
the dropped tables so that dropped table names can be reused.
2025-03-10 15:00:53 -07:00
msu-reevo
cc81f3e1a5 fix(python): typing (#2167)
@wjones127 is there a standard way you guys setup your virtualenv? I can
either relist all the dependencies in the pyright precommit section, or
specify a venv, or the user has to be in the virtual environment when
they run git commit. If the venv location was standardized or a python
manager like `uv` was used it would be easier to avoid duplicating the
pyright dependency list.

Per your suggestion, in `pyproject.toml` I added in all the passing
files to the `includes` section.

For ruff I upgraded the version and removed "TCH" which doesn't exist as
an option.

I added a `pyright_report.csv` which contains a list of all files sorted
by pyright errors ascending as a todo list to work on.

I fixed about 30 issues in `table.py` stemming from str's being passed
into methods that required a string within a set of string Literals by
extracting them into `types.py`

Can you verify in the rust bridge that the schema should be a property
and not a method here? If it's a method, then there's another place in
the code where `inner.schema` should be `inner.schema()`
``` python
class RecordBatchStream:
    @property
    def schema(self) -> pa.Schema: ...
```

Also unless the `_lancedb.pyi` file is wrong, then there is no
`__anext__` here for `__inner` when it's not an `AsyncGenerator` and
only `next` is defined:
``` python
    async def __anext__(self) -> pa.RecordBatch:
        return await self._inner.__anext__()
        if isinstance(self._inner, AsyncGenerator):
            batch = await self._inner.__anext__()
        else:
            batch = await self._inner.next()
        if batch is None:
            raise StopAsyncIteration
        return batch
```
in the else statement, `_inner` is a `RecordBatchStream`
```python
class RecordBatchStream:
    @property
    def schema(self) -> pa.Schema: ...
    async def next(self) -> Optional[pa.RecordBatch]: ...
```

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-03-10 09:01:23 -07:00
Weston Pace
bc49c4db82 feat: respect datafusion's batch size when running as a table provider (#2187)
Datafusion makes the batch size available as part of the `SessionState`.
We should use that to set the `max_batch_length` property in the
`QueryExecutionOptions`.
2025-03-07 05:53:36 -08:00
Weston Pace
d2eec46f17 feat: add support for streaming input to create_table (#2175)
This PR makes it possible to create a table using an asynchronous stream
of input data. Currently only a synchronous iterator is supported. There
are a number of follow-ups not yet tackled:

* Support for embedding functions (the embedding functions wrapper needs
to be re-written to be async, should be an easy lift)
* Support for async input into the remote table (the make_ipc_batch
needs to change to accept async input, leaving undone for now because I
think we want to support actual streaming uploads into the remote table
soon)
* Support for async input into the add function (pretty essential, but
it is a fairly distinct code path, so saving for a different PR)
2025-03-06 11:55:00 -08:00
Lance Release
51437bc228 Bump version: 0.21.0-beta.0 → 0.21.0-beta.1 2025-03-06 19:23:06 +00:00
Bert
fa53cfcfd2 feat: support modifying field metadata in lancedb python (#2178) 2025-03-04 16:58:46 -05:00
vinoyang
374fe0ad95 feat(rust): introduce Catalog trait and implement ListingCatalog (#2148)
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2025-03-03 20:22:24 -08:00
BubbleCal
35e5b84ba9 chore: upgrade lance to 0.24.0-beta.1 (#2171)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-03-03 12:32:12 +08:00
Lei Xu
7c12d497b0 ci: bump python to 3.12 in GHA (#2169) 2025-03-01 17:24:02 -08:00
ayao227
dfe4ba8dad chore: add reo integration (#2149)
This PR adds reo integration to the lancedb documentation website.
2025-02-28 07:51:34 -08:00
Weston Pace
fa1b9ad5bd fix: don't use with_schema to remove schema metadata (#2162)
It seems that `RecordBatch::with_schema` is unable to remove schema
metadata from a batch. It fails with the error `target schema is not
superset of current schema`.

I'm not sure how the `test_metadata_erased` test is passing. Strangely,
the metadata was not present by the time the batch arrived at the
metadata eraser. I think maybe the schema metadata is only present in
the batch if there is a filter.

I've created a new unit test that makes sure the metadata is erased if
we have a filter also
2025-02-27 10:24:00 -08:00
BubbleCal
8877eb020d feat: record the server version for remote table (#2147)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-02-27 15:55:59 +08:00
Will Jones
01e4291d21 feat(python): drop hard dependency on pylance (#2156)
Closes #1793
2025-02-26 15:53:45 -08:00
Lance Release
ab3ea76ad1 Updating package-lock.json 2025-02-26 21:23:39 +00:00
Lance Release
728ef8657d Updating package-lock.json 2025-02-26 20:11:37 +00:00
Lance Release
0b13901a16 Updating package-lock.json 2025-02-26 20:11:22 +00:00
Lance Release
84b110e0ef Bump version: 0.17.0 → 0.18.0-beta.0 2025-02-26 20:11:07 +00:00
Lance Release
e1836e54e3 Bump version: 0.20.0 → 0.21.0-beta.0 2025-02-26 20:10:54 +00:00
Weston Pace
4ba5326880 feat: reapply upgrade lance to v0.23.3-beta.1 (#2157)
This reverts commit 2f0c5baea2.

---------

Co-authored-by: Lu Qiu <luqiujob@gmail.com>
2025-02-26 11:44:11 -08:00
Lance Release
b036a69300 Updating package-lock.json 2025-02-26 19:32:22 +00:00
Will Jones
5b12a47119 feat!: revert query limit to be unbounded for scans (#2151)
In earlier PRs (#1886, #1191) we made the default limit 10 regardless of
the query type. This was confusing for users and in many cases a
breaking change. Users would have queries that used to return all
results, but instead only returned the first 10, causing silent bugs.

Part of the cause was consistency: the Python sync API seems to have
always had a limit of 10, while newer APIs (Python async and Nodejs)
didn't.

This PR sets the default limit only for searches (vector search, FTS),
while letting scans (even with filters) be unbounded. It does this
consistently for all SDKs.

Fixes #1983
Fixes #1852
Fixes #2141
2025-02-26 10:32:14 -08:00
Lance Release
769d483e50 Updating package-lock.json 2025-02-26 18:16:59 +00:00
Lance Release
9ecb11fe5a Updating package-lock.json 2025-02-26 18:16:42 +00:00
Lance Release
22bd8329f3 Bump version: 0.17.0-beta.0 → 0.17.0 2025-02-26 18:16:07 +00:00
Lance Release
a736fad149 Bump version: 0.16.1-beta.3 → 0.17.0-beta.0 2025-02-26 18:16:01 +00:00
Lance Release
072adc41aa Bump version: 0.20.0-beta.0 → 0.20.0 2025-02-26 18:15:23 +00:00
Lance Release
c6f25ef1f0 Bump version: 0.19.1-beta.3 → 0.20.0-beta.0 2025-02-26 18:15:23 +00:00
Weston Pace
2f0c5baea2 Revert "chore: upgrade lance to v0.23.3-beta.1 (#2153)"
This reverts commit a63dd66d41.
2025-02-26 10:14:29 -08:00
BubbleCal
a63dd66d41 chore: upgrade lance to v0.23.3-beta.1 (#2153)
this fixes a bug in SQ, see https://github.com/lancedb/lance/pull/3476
for more details

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: Lu Qiu <luqiujob@gmail.com>
2025-02-26 09:52:28 -08:00
Weston Pace
d6b3ccb37b feat: upgrade lance to 0.23.2 (#2152)
This also changes the pylance pin from `==0.23.2` to `~=0.23.2` which
should allow the pylance dependency to float a little. The pylance
dependency is actually not used for much anymore and so it should be
tolerant of patch changes.
2025-02-26 09:02:51 -08:00
Weston Pace
c4f99e82e5 feat: push filters down into DF table provider (#2128) 2025-02-25 14:46:28 -08:00
andrew-pienso
979a2d3d9d docs: fixes is_open docstring on AsyncTable (#2150) 2025-02-25 09:11:25 -08:00
Will Jones
7ac5f74c80 feat!: add variable store to embeddings registry (#2112)
BREAKING CHANGE: embedding function implementations in Node need to now
call `resolveVariables()` in their constructors and should **not**
implement `toJSON()`.

This tries to address the handling of secrets. In Node, they are
currently lost. In Python, they are currently leaked into the table
schema metadata.

This PR introduces an in-memory variable store on the function registry.
It also allows embedding function definitions to label certain config
values as "sensitive", and the preprocessing logic will raise an error
if users try to pass in hard-coded values.

Closes #2110
Closes #521

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2025-02-24 15:52:19 -08:00
Will Jones
ecdee4d2b1 feat(python): add search() method to async API (#2049)
Reviving #1966.

Closes #1938

The `search()` method can apply embeddings for the user. This simplifies
hybrid search, so instead of writing:

```python
vector_query = embeddings.compute_query_embeddings("flower moon")[0]
await (
    async_tbl.query()
    .nearest_to(vector_query)
    .nearest_to_text("flower moon")
    .to_pandas()
)
```

You can write:

```python
await (await async_tbl.search("flower moon", query_type="hybrid")).to_pandas()
```

Unfortunately, we had to do a double-await here because `search()` needs
to be async. This is because it often needs to do IO to retrieve and run
an embedding function.
2025-02-24 14:19:25 -08:00
BubbleCal
f391ed828a fix: remote table doesn't apply the prefilter flag for FTS (#2145) 2025-02-24 21:37:43 +08:00
BubbleCal
a99a450f2b fix: flat FTS panic with prefilter and update lance (#2144)
this is fixed in lance so upgrade lance to 0.23.2-beta1
2025-02-24 14:34:00 +08:00
Lei Xu
6fa1f37506 docs: improve pydantic integration docs (#2136)
Address usage mistakes in
https://github.com/lancedb/lancedb/issues/2135.

* Add example of how to use `LanceModel` and `Vector` decorator
* Add test for pydantic doc
* Fix the example to directly use LanceModel instead of calling
`MyModel.to_arrow_schema()` in the example.
* Add cross-reference link to pydantic doc site
* Configure mkdocs to watch code changes in python directory.
2025-02-21 12:48:37 -08:00
BubbleCal
544382df5e fix: handle batch quires in single request (#2139) 2025-02-21 13:23:39 +08:00
BubbleCal
784f00ef6d chore: update Cargo.lock (#2137) 2025-02-21 12:27:10 +08:00
Lance Release
96d7446f70 Updating package-lock.json 2025-02-20 04:51:26 +00:00
Lance Release
99ea78fb55 Updating package-lock.json 2025-02-20 03:38:44 +00:00
Lance Release
8eef4cdc28 Updating package-lock.json 2025-02-20 03:38:27 +00:00
Lance Release
0f102f02c3 Bump version: 0.16.1-beta.2 → 0.16.1-beta.3 2025-02-20 03:38:01 +00:00
Lance Release
a33a0670f6 Bump version: 0.19.1-beta.2 → 0.19.1-beta.3 2025-02-20 03:37:27 +00:00
BubbleCal
14c9ff46d1 feat: support multivector on remote table (#2045)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-02-20 11:34:51 +08:00
Lei Xu
1865f7decf fix: support optional nested pydantic model (#2130)
Closes #2129
2025-02-17 20:43:13 -08:00
BubbleCal
a608621476 test: query with dist range and new rows (#2126)
we found a bug that flat KNN plan node's stats is not in right order as
fields in schema, it would cause an error if querying with distance
range and new unindexed rows.

we've fixed this in lance so add this test for verifying it works

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-02-17 12:57:45 +08:00
BubbleCal
00514999ff feat: upgrade lance to 0.23.1-beta.4 (#2121)
this also upgrades object_store to 0.11.0, snafu to 0.8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-02-16 14:53:26 +08:00
Lance Release
b3b597fef6 Updating package-lock.json 2025-02-13 04:40:10 +00:00
Lance Release
bf17144591 Updating package-lock.json 2025-02-13 04:39:54 +00:00
Lance Release
09e110525f Bump version: 0.16.1-beta.1 → 0.16.1-beta.2 2025-02-13 04:39:38 +00:00
Lance Release
40f0dbb64d Bump version: 0.19.1-beta.1 → 0.19.1-beta.2 2025-02-13 04:39:19 +00:00
BubbleCal
3b19e96ae7 fix: panic when field id doesn't equal to field index (#2116)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-02-13 12:38:35 +08:00
Will Jones
78a17ad54c chore: improve dev instructions for Python (#2088)
Closes #2042
2025-02-12 14:08:52 -08:00
Lance Release
a8e6b491e2 Updating package-lock.json 2025-02-11 22:05:54 +00:00
Lance Release
cea541ca46 Updating package-lock.json 2025-02-11 20:56:22 +00:00
Lance Release
873ffc1042 Updating package-lock.json 2025-02-11 20:56:05 +00:00
Lance Release
83273ad997 Bump version: 0.16.1-beta.0 → 0.16.1-beta.1 2025-02-11 20:55:43 +00:00
Lance Release
d18d63c69d Bump version: 0.19.1-beta.0 → 0.19.1-beta.1 2025-02-11 20:55:23 +00:00
LuQQiu
c3e865e8d0 fix: fix index out of bound in load indices (#2108)
panicked at 'index out of bounds: the len is 24 but the index is
25':Lancedb/rust/lancedb/src/index/vector.rs:26\n

load_indices() on the old manifest while use the newer manifest to get
column names could result in index out of bound if some columns are
removed from the new version.
This change reduce the possibility of index out of bound operation but
does not fully remove it.
Better that lance can directly provide column name info so no need extra
calls to get column name but that require modify the public APIs
2025-02-11 12:54:11 -08:00
Weston Pace
a7755cb313 docs: standardize node example prints (#2080)
Minor cleanup to help debug future CI failures
2025-02-11 08:26:29 -08:00
BubbleCal
3490f3456f chore: upgrade lance to 0.23.1-beta.2 (#2109) 2025-02-11 23:57:56 +08:00
Lance Release
0a1d0693e1 Updating package-lock.json 2025-02-07 20:06:22 +00:00
Lance Release
fd330b4b4b Updating package-lock.json 2025-02-07 19:28:01 +00:00
Lance Release
d4e9fc08e0 Updating package-lock.json 2025-02-07 19:27:44 +00:00
Lance Release
3626f2f5e1 Bump version: 0.16.0 → 0.16.1-beta.0 2025-02-07 19:27:26 +00:00
Lance Release
e64712cfa5 Bump version: 0.19.0 → 0.19.1-beta.0 2025-02-07 19:27:07 +00:00
Wyatt Alt
3e3118f85c feat: update lance dependency to 0.23.1-beta.1 (#2102) 2025-02-07 10:56:01 -08:00
Lance Release
592598a333 Updating package-lock.json 2025-02-07 18:50:53 +00:00
Lance Release
5ad21341c9 Updating package-lock.json 2025-02-07 17:34:04 +00:00
Lance Release
6e08caa091 Updating package-lock.json 2025-02-07 17:33:48 +00:00
Lance Release
7e259d8b0f Bump version: 0.16.0-beta.0 → 0.16.0 2025-02-07 17:33:13 +00:00
Lance Release
e84f747464 Bump version: 0.15.1-beta.3 → 0.16.0-beta.0 2025-02-07 17:33:08 +00:00
Lance Release
998cd43fe6 Bump version: 0.19.0-beta.0 → 0.19.0 2025-02-07 17:32:26 +00:00
Lance Release
4bc7eebe61 Bump version: 0.18.1-beta.4 → 0.19.0-beta.0 2025-02-07 17:32:26 +00:00
Will Jones
2e3b34e79b feat(node): support inserting and upserting subschemas (#2100)
Fixes #2095
Closes #1832
2025-02-07 09:30:18 -08:00
Will Jones
e7574698eb feat: upgrade Lance to 0.23.0 (#2101)
Upstream changelog:
https://github.com/lancedb/lance/releases/tag/v0.23.0
2025-02-07 07:58:07 -08:00
Will Jones
801a9e5f6f feat(python): streaming larger-than-memory writes (#2094)
Makes our preprocessing pipeline do transforms in streaming fashion, so
users can do larger-then-memory writes.

Closes #2082
2025-02-06 16:37:30 -08:00
Weston Pace
4e5fbe6c99 fix: ensure metadata erased from schema call in table provider (#2099)
This also adds a basic unit test for the table provider
2025-02-06 15:30:20 -08:00
Weston Pace
1a449fa49e refactor: rename drop_db / drop_database to drop_all_tables, expose database from connection (#2098)
If we start supporting external catalogs then "drop database" may be
misleading (and not possible). We should be more clear that this is a
utility method to drop all tables. This is also a nice chance for some
consistency cleanup as it was `drop_db` in rust, `drop_database` in
python, and non-existent in typescript.

This PR also adds a public accessor to get the database trait from a
connection.

BREAKING CHANGE: the `drop_database` / `drop_db` methods are now
deprecated.
2025-02-06 13:22:28 -08:00
Weston Pace
6bf742c759 feat: expose table trait (#2097)
Similar to
c269524b2f
this PR reworks and exposes an internal trait (this time
`TableInternal`) to be a public trait. These two PRs together should
make it possible for others to integrate LanceDB on top of other
catalogs.

This PR also adds a basic `TableProvider` implementation for tables,
although some work still needs to be done here (pushdown not yet
enabled).
2025-02-05 18:13:51 -08:00
Ryan Green
ef3093bc23 feat: drop_index() remote implementation (#2093)
Support drop_index operation in remote table.
2025-02-05 10:06:19 -03:30
Will Jones
16851389ea feat: extra headers parameter in client options (#2091)
Closes #1106

Unfortunately, these need to be set at the connection level. I
investigated whether if we let users provide a callback they could use
`AsyncLocalStorage` to access their context. However, it doesn't seem
like NAPI supports this right now. I filed an issue:
https://github.com/napi-rs/napi-rs/issues/2456
2025-02-04 17:26:45 -08:00
Weston Pace
c269524b2f feat!: refactor ConnectionInternal into a Database trait (#2067)
This opens up the door for more custom database implementations than the
two we have today. The biggest change should be inivisble:
`ConnectionInternal` has been renamed to `Database`, made public, and
refactored

However, there are a few breaking changes. `data_storage_version` and
`enable_v2_manifest_paths` have been moved from options on
`create_table` to options for the database which are now set via
`storage_options`.

Before:
```
db = connect(uri)
tbl = db.create_table("my_table", data, data_storage_version="legacy", enable_v2_manifest_paths=True)
```

After:
```
db = connect(uri, storage_options={
  "new_table_enable_v2_manifest_paths": "true",
  "new_table_data_storage_version": "legacy"
})
tbl = db.create_table("my_table", data)
```

BREAKING CHANGE: the data_storage_version, enable_v2_manifest_paths
options have moved from options to create_table to storage_options.
BREAKING CHANGE: the use_legacy_format option has been removed,
data_storage_version has replaced it for some time now
2025-02-04 14:35:14 -08:00
Lance Release
f6eef14313 Bump version: 0.18.1-beta.3 → 0.18.1-beta.4 2025-02-04 17:25:52 +00:00
Rob Meng
32716adaa3 chore: bump lance version (#2092) 2025-02-04 12:25:05 -05:00
Lance Release
5e98b7f4c0 Updating package-lock.json 2025-02-01 02:27:43 +00:00
Lance Release
3f2589c11f Updating package-lock.json 2025-02-01 01:22:22 +00:00
Lance Release
e3b99694d6 Updating package-lock.json 2025-02-01 01:22:05 +00:00
Lance Release
9d42dc349c Bump version: 0.15.1-beta.2 → 0.15.1-beta.3 2025-02-01 01:21:28 +00:00
Lance Release
482f1ee1d3 Bump version: 0.18.1-beta.2 → 0.18.1-beta.3 2025-02-01 01:20:49 +00:00
Will Jones
2f39274a66 feat: upgrade lance to 0.23.0-beta.4 (#2089)
Upstream changelog:
https://github.com/lancedb/lance/releases/tag/v0.23.0-beta.4
2025-01-31 17:20:15 -08:00
Will Jones
2fc174f532 docs: add sync/async tabs to quickstart (#2087)
Closes #2033
2025-01-31 15:43:54 -08:00
Will Jones
dba85f4d6f docs: user guide for merge insert (#2083)
Closes #2062
2025-01-31 10:03:21 -08:00
Jeff Simpson
555fa26147 fix(rust): add embedding_registry on open_table (#2086)
# Description

Fix for: https://github.com/lancedb/lancedb/issues/1581

This is the same implementation as
https://github.com/lancedb/lancedb/pull/1781 but with the addition of a
unit test and rustfmt.
2025-01-31 08:48:02 -08:00
Will Jones
e05c0cd87e ci(node): check docs in CI (#2084)
* Make `npm run docs` fail if there are any warnings. This will catch
items missing from the API reference.
* Add a check in our CI to make sure `npm run dos` runs without warnings
and doesn't generate any new files (indicating it might be out-of-date.
* Hide constructors that aren't user facing.
* Remove unused enum `WriteMode`.

Closes #2068
2025-01-30 16:06:06 -08:00
Lance Release
25c17ebf4e Updating package-lock.json 2025-01-30 18:24:59 +00:00
Lance Release
87b12b57dc Updating package-lock.json 2025-01-30 17:33:15 +00:00
Lance Release
3dc9b71914 Updating package-lock.json 2025-01-30 17:32:59 +00:00
Lance Release
2622f34d1a Bump version: 0.15.1-beta.1 → 0.15.1-beta.2 2025-01-30 17:32:33 +00:00
Will Jones
a677a4b651 ci: fix arm64 windows cross compile build (#2081)
* Adds a CI job to check the cross compiled Windows ARM build.
* Didn't replace the test build because we need native build to run
tests. But for some reason (I forget why) we need cross compiled for
nodejs.
* Pinned crunchy to workaround
https://github.com/eira-fransham/crunchy/issues/13

This is needed to fix failure from
https://github.com/lancedb/lancedb/actions/runs/13020773184/job/36320719331
2025-01-30 09:24:20 -08:00
Weston Pace
e6b4f14c1f docs: clarify upper case characters in column names need to be escaped (#2079) 2025-01-29 09:34:43 -08:00
Will Jones
15f8f4d627 ci: check license headers (#2076)
Based on the same workflow in Lance.
2025-01-29 08:27:07 -08:00
Will Jones
6526d6c3b1 ci(rust): caching improvements (up to 2.8x faster builds) (#2075)
Some Rust jobs (such as
[Rust/linux](https://github.com/lancedb/lancedb/actions/runs/13019232960/job/36315830779))
take almost minutes. This can be a bit of a bottleneck.

* Two fixes to make caches more effective
* Check in `Cargo.lock` so that dependencies don't change much between
runs
      * Added a new CI job to validate we can build without a lockfile
* Altered build commands so they don't have contradictory features and
therefore don't trigger multiple builds

Sadly, I don't think there's much to be done for windows-arm64, as much
of the compile time is because the base image is so bare we need to
install the build tools ourselves.
2025-01-29 08:26:45 -08:00
Lance Release
da4d7e3ca7 Updating package-lock.json 2025-01-28 22:32:20 +00:00
Lance Release
8fbadca9aa Updating package-lock.json 2025-01-28 22:32:05 +00:00
Lance Release
29120219cf Bump version: 0.15.1-beta.0 → 0.15.1-beta.1 2025-01-28 22:31:39 +00:00
Lance Release
a9897d9d85 Bump version: 0.18.1-beta.1 → 0.18.1-beta.2 2025-01-28 22:31:14 +00:00
Will Jones
acda7a4589 feat: upgrade lance to v0.23.0-beta.3 (#2074)
This includes several bugfixes for `merge_insert` and null handling in
vector search.

https://github.com/lancedb/lance/releases/tag/v0.23.0-beta.3
2025-01-28 14:00:06 -08:00
Vaibhav
dac0857745 feat: add distance_type() parameter to python sync query builders and metric() as an alias (#2073)
This PR aims to fix #2047 by doing the following things:
- Add a distance_type parameter to the sync query builders of Python
SDK.
- Make metric an alias to distance_type.
2025-01-28 13:59:53 -08:00
Will Jones
0a9e1eab75 fix(node): createTable() should save embeddings, and mergeInsert should use them (#2065)
* `createTable()` now saves embeddings in the schema metadata.
Previously, it would drop them. (`createEmptyTable()` was already tested
and worked.)
* `mergeInsert()` now uses embeddings.

Fixes #2066
2025-01-28 12:38:50 -08:00
V
d999d72c8d docs: pandas example (#2044)
Fix example for section ## From pandas DataFrame
2025-01-24 11:37:47 -08:00
Lance Release
de4720993e Updating package-lock.json 2025-01-23 23:02:20 +00:00
Lance Release
6c14a307e2 Updating package-lock.json 2025-01-23 23:02:03 +00:00
Lance Release
43747278c8 Bump version: 0.15.0 → 0.15.1-beta.0 2025-01-23 23:01:40 +00:00
Lance Release
e5f42a850e Bump version: 0.18.1-beta.0 → 0.18.1-beta.1 2025-01-23 23:01:13 +00:00
Will Jones
7920ecf66e ci(python): stop using deprecated 2_24 manylinux for arm (#2064)
Based on changes made in Lance:

* https://github.com/lancedb/lance/pull/3409
* https://github.com/lancedb/lance/pull/3411
2025-01-23 15:00:34 -08:00
Will Jones
28e1b70e4b fix(python): preserve original distance and score in hybrid queries (#2061)
Fixes #2031

When we do hybrid search, we normalize the scores. We do this
calculation in-place, because the Rerankers expect the `_distance` and
`_score` columns to be the normalized ones. So I've changed the logic so
that we restore the original distance and scores by matching on row ids.
2025-01-23 13:54:26 -08:00
Will Jones
52b79d2b1e feat: upgrade lance to v0.23.0-beta.2 (#2063)
Fixes https://github.com/lancedb/lancedb/issues/2043
2025-01-23 13:51:30 -08:00
Bert
c05d45150d docs: clarify the arguments for replace_field_metadata (#2053)
When calling `replace_field_metadata` we pass in an iter of tuples
`(u32, HashMap<String, String>)`.

That `u32` needs to be the field id from the lance schema

7f60aa0a87/rust/lance-core/src/datatypes/field.rs (L123)

This can sometimes be different than the index of the field in the arrow
schema (e.g. if fields have been dropped).

This PR adds docs that try to clarify what that argument should be, as
well as corrects the usage in the test (which was improperly passing the
index of the arrow schema).
2025-01-23 08:52:27 -05:00
BubbleCal
48ed3bb544 chore: replace the util to lance's (#2052)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-23 11:04:37 +08:00
Will Jones
bcfc93cc88 fix(python): various fixes for async query builders (#2048)
This includes several improvements and fixes to the Python Async query
builders:

1. The API reference docs show all the methods for each builder
2. The hybrid query builder now has all the same setter methods as the
vector search one, so you can now set things like `.distance_type()` on
a hybrid query.
3. Re-rankers are now properly hooked up and tested for FTS and vector
search. Previously the re-rankers were accidentally bypassed in unit
tests, because the builders overrode `.to_arrow()`, but the unit test
called `.to_batches()` which was only defined in the base class. Now all
builders implement `.to_batches()` and leave `.to_arrow()` to the base
class.
4. The `AsyncQueryBase` and `AsyncVectoryQueryBase` setter methods now
return `Self`, which provides the appropriate subclass as the type hint
return value. Previously, `AsyncQueryBase` had them all hard-coded to
`AsyncQuery`, which was unfortunate. (This required bringing in
`typing-extensions` for older Python version, but I think it's worth
it.)
2025-01-20 16:14:34 -08:00
BubbleCal
214d0debf5 docs: claim LanceDB supports float16/float32/float64 for multivector (#2040) 2025-01-21 07:04:15 +08:00
Will Jones
f059372137 feat: add drop_index() method (#2039)
Closes #1665
2025-01-20 10:08:51 -08:00
Lance Release
3dc1803c07 Bump version: 0.18.0 → 0.18.1-beta.0 2025-01-17 04:37:23 +00:00
BubbleCal
d0501f65f1 fix: linear reranker applies wrong score to combine (#2035)
related to #2014 
this fixes:
- linear reranker may lost some results if the merging consumes all
vector results earlier than fts results
- linear reranker inverts the fts score but only vector distance can be
inverted

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-17 11:33:48 +08:00
Bert
4703cc6894 chore: upgrade lance to v0.22.1-beta.3 (#2038) 2025-01-16 12:42:42 -05:00
BubbleCal
493f9ce467 fix: can't infer the vector column for multivector (#2026)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-16 14:08:04 +08:00
Weston Pace
5c759505b8 feat: upgrade lance 0.22.1b1 (#2029)
Now the version actually exists :)
2025-01-15 07:37:37 -08:00
BubbleCal
bb6a39727e fix: missing distance type for auto index on RemoteTable (#2027)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-15 20:28:55 +08:00
BubbleCal
d57bed90e5 docs: add missing example code (#2025) 2025-01-14 21:17:05 -08:00
BubbleCal
648327e90c docs: show how to pack bits for binary vector (#2020)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-14 09:00:57 -08:00
Lance Release
6c7e81ee57 Updating package-lock.json 2025-01-14 02:14:37 +00:00
Lance Release
905e9d4738 Updating package-lock.json 2025-01-14 01:03:49 +00:00
Lance Release
38642e349c Updating package-lock.json 2025-01-14 01:03:33 +00:00
Lance Release
6879861ea8 Bump version: 0.15.0-beta.1 → 0.15.0 2025-01-14 01:03:04 +00:00
Lance Release
88325e488e Bump version: 0.15.0-beta.0 → 0.15.0-beta.1 2025-01-14 01:02:59 +00:00
Lance Release
995bd9bf37 Bump version: 0.18.0-beta.1 → 0.18.0 2025-01-14 01:02:26 +00:00
Lance Release
36cc06697f Bump version: 0.18.0-beta.0 → 0.18.0-beta.1 2025-01-14 01:02:25 +00:00
Will Jones
35da464591 ci: fix stable check (#2019) 2025-01-13 17:01:54 -08:00
Will Jones
31f9c30ffb chore: fix test of error message (#2018)
Addresses failure on `main`:
https://github.com/lancedb/lancedb/actions/runs/12757756657/job/35558683317
2025-01-13 15:36:46 -08:00
Will Jones
92dcf24b0c feat: upgrade Lance to v0.22.0 (#2017)
Upstream changelog:
https://github.com/lancedb/lance/releases/tag/v0.22.0
2025-01-13 15:06:01 -08:00
Will Jones
6b0adba2d9 chore: add deprecation warning to vectordb (#2003) 2025-01-13 14:53:12 -08:00
BubbleCal
66cbf6b6c5 feat: support multivector type (#2005)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-13 14:10:40 -08:00
Keming
ce9506db71 docs(hnsw): fix markdown list style (#2015) 2025-01-13 08:53:13 -08:00
Prashant Dixit
b66cd943a7 fix: broken voyageai embedding API (#2013)
This PR fixes the broken Embedding API for Voyageai.
2025-01-13 08:52:38 -08:00
Weston Pace
d8d11f48e7 feat: upgrade to lance 0.22.0b1 (#2011) 2025-01-10 12:51:52 -08:00
Lance Release
7ec5df3022 Updating package-lock.json 2025-01-10 19:58:10 +00:00
Lance Release
b17304172c Updating package-lock.json 2025-01-10 19:02:31 +00:00
Lance Release
fbe5408434 Updating package-lock.json 2025-01-10 19:02:15 +00:00
Lance Release
3f3f845c5a Bump version: 0.14.2-beta.0 → 0.15.0-beta.0 2025-01-10 19:01:47 +00:00
Lance Release
fbffe532a8 Bump version: 0.17.2-beta.2 → 0.18.0-beta.0 2025-01-10 19:01:20 +00:00
Josef Gugglberger
55ffc96e56 docs: update storage.md, fix Azure Sync connect example (#2010)
In the sync code example there was also an `await`.


![image](https://github.com/user-attachments/assets/4e1a1bd9-f2fb-4dbe-a9a6-1384ab63edbb)
2025-01-10 09:01:19 -08:00
Mr. Doge
998c5f3f74 ci: add dbghelp.lib to sysroot-aarch64-pc-windows-msvc.sh (#1975) (#2008)
successful runs:
https://github.com/FuPeiJiang/lancedb/actions/runs/12698662005
2025-01-09 14:24:09 -08:00
Will Jones
6eacae18c4 test: fix test failure from merge (#2007) 2025-01-09 11:27:24 -08:00
Bert
d3ea75cc2b feat: expose dataset config (#2004)
Expose methods on NativeTable for updating schema metadata and dataset
config & getting the dataset config via the manifest.
2025-01-08 21:13:18 -05:00
Bert
f4afe456e8 feat!: change default from postfiltering to prefiltering for sync python (#2000)
BREAKING CHANGE: prefiltering is now the default in the synchronous
python SDK

resolves: #1872
2025-01-08 19:13:58 -05:00
Renato Marroquin
ea5c2266b8 feat(python): support .rerank() on non-hybrid queries in Async API (WIP) (#1972)
Fixes https://github.com/lancedb/lancedb/issues/1950

---------

Co-authored-by: Renato Marroquin <renato.marroquin@oracle.com>
2025-01-08 16:42:47 -05:00
Will Jones
c557e77f09 feat(python)!: support inserting and upserting subschemas (#1965)
BREAKING CHANGE: For a field "vector", list of integers will now be
converted to binary (uint8) vectors instead of f32 vectors. Use float
values instead for f32 vectors.

* Adds proper support for inserting and upserting subsets of the full
schema. I thought I had previously implemented this in #1827, but it
turns out I had not tested carefully enough.
* Refactors `_santize_data` and other utility functions to be simpler
and not require `numpy` or `combine_chunks()`.
* Added a new suite of unit tests to validate sanitization utilities.

## Examples

```python
import pandas as pd
import lancedb

db = lancedb.connect("memory://demo")
intial_data = pd.DataFrame({
    "a": [1, 2, 3],
    "b": [4, 5, 6],
    "c": [7, 8, 9]
})
table = db.create_table("demo", intial_data)

# Insert a subschema
new_data = pd.DataFrame({"a": [10, 11]})
table.add(new_data)
table.to_pandas()
```
```
    a    b    c
0   1  4.0  7.0
1   2  5.0  8.0
2   3  6.0  9.0
3  10  NaN  NaN
4  11  NaN  NaN
```


```python
# Upsert a subschema
upsert_data = pd.DataFrame({
    "a": [3, 10, 15],
    "b": [6, 7, 8],
})
table.merge_insert(on="a").when_matched_update_all().when_not_matched_insert_all().execute(upsert_data)
table.to_pandas()
```
```
    a    b    c
0   1  4.0  7.0
1   2  5.0  8.0
2   3  6.0  9.0
3  10  7.0  NaN
4  11  NaN  NaN
5  15  8.0  NaN
```
2025-01-08 10:11:10 -08:00
BubbleCal
3c0a64be8f feat: support distance range in queries (#1999)
this also updates the docs

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-08 11:03:27 +08:00
Will Jones
0e496ed3b5 docs: contributing guide (#1970)
* Adds basic contributing guides.
* Simplifies Python development with a Makefile.
2025-01-07 15:11:16 -08:00
QianZhu
17c9e9afea docs: add async examples to doc (#1941)
- added sync and async tabs for python examples
- moved python code to tests/docs

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-01-07 15:10:25 -08:00
Wyatt Alt
0b45ef93c0 docs: assorted copyedits (#1998)
This includes a handful of minor edits I made while reading the docs. In
addition to a few spelling fixes,
* standardize on "rerank" over "re-rank" in prose
* terminate sentences with periods or colons as appropriate
* replace some usage of dashes with colons, such as in "Try it yourself
- <link>"

All changes are surface-level. No changes to semantics or structure.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-01-06 15:04:48 -08:00
Gagan Bhullar
b474f98049 feat(python): flatten in AsyncQuery (#1967)
PR fixes #1949

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-01-06 10:52:03 -08:00
Takahiro Ebato
2c05ffed52 feat(python): add to_polars to AsyncQueryBase (#1986)
Fixes https://github.com/lancedb/lancedb/issues/1952

Added `to_polars` method to `AsyncQueryBase`.
2025-01-06 09:35:28 -08:00
Will Jones
8b31540b21 ci: prevent stable release with preview lance (#1995)
Accidentally referenced a preview release in our stable release of
LanceDB. This adds a CI check to prevent that.
2025-01-06 08:54:14 -08:00
Lance Release
ba844318f8 Updating package-lock.json 2025-01-06 06:26:41 +00:00
Lance Release
f007b76153 Updating package-lock.json 2025-01-06 05:35:28 +00:00
Lance Release
5d8d258f59 Updating package-lock.json 2025-01-06 05:35:13 +00:00
Lance Release
4172140f74 Bump version: 0.14.1 → 0.14.2-beta.0 2025-01-06 05:34:52 +00:00
Lance Release
a27c5cf12b Bump version: 0.17.2-beta.1 → 0.17.2-beta.2 2025-01-06 05:34:27 +00:00
BubbleCal
f4dea72cc5 feat: support vector search with distance thresholds (#1993)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-06 13:23:39 +08:00
Lei Xu
f76c4a5ce1 chore: add pyright static type checking and fix some of the table interface (#1996)
* Enable `pyright` in the project
* Fixed some pyright typing errors in `table.py`
2025-01-04 15:24:58 -08:00
ahaapple
164ce397c2 docs: fix full-text search (Native FTS) TypeScript doc error (#1992)
Fix

```
Cannot find name 'queryType'.ts(2304)
any
```
2025-01-03 13:36:10 -05:00
BubbleCal
445a312667 fix: selecting columns failed on FTS and hybrid search (#1991)
it reports error `AttributeError: 'builtins.FTSQuery' object has no
attribute 'select_columns'`
because we missed `select_columns` method in rust

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-03 13:08:12 +08:00
Lance Release
92d845fa72 Bump version: 0.17.2-beta.0 → 0.17.2-beta.1 2024-12-31 23:36:18 +00:00
Lei Xu
397813f6a4 chore: bump pylance to 0.21.1b1 (#1989) 2024-12-31 15:34:27 -08:00
Lei Xu
50c30c5d34 chore(python): fix typo of the synchronized checkout API (#1988) 2024-12-30 18:54:31 -08:00
Bert
c9f248b058 feat: add hybrid search to node and rust SDKs (#1940)
Support hybrid search in both rust and node SDKs.

- Adds a new rerankers package to rust LanceDB, with the implementation
of the default RRF reranker
- Adds a new hybrid package to lancedb, with some helper methods related
to hybrid search such as normalizing scores and converting score column
to rank columns
- Adds capability to LanceDB VectorQuery to perform hybrid search if it
has both a nearest vector and full text search parameters.
- Adds wrappers for reranker implementations to nodejs SDK.

Additional rerankers will be added in followup PRs

https://github.com/lancedb/lancedb/issues/1921

---
Notes about how the rust rerankers are wrapped for calling from JS:

I wanted to keep the core reranker logic, and the invocation of the
reranker by the query code, in Rust. This aligns with the philosophy of
the new node SDK where it's just a thin wrapper around Rust. However, I
also wanted to have support for users who want to add custom rerankers
written in Javascript.

When we add a reranker to the query from Javascript, it adds a special
Rust reranker that has a callback to the Javascript code (which could
then turn around and call an underlying Rust reranker implementation if
desired). This adds a bit of complexity, but overall I think it moves us
in the right direction of having the majority of the query logic in the
underlying Rust SDK while keeping the option open to support custom
Javascript Rerankers.
2024-12-30 09:03:41 -05:00
Renato Marroquin
0cb6da6b7e docs: add new indexes to python docs (#1945)
closes issue #1855

Co-authored-by: Renato Marroquin <renato.marroquin@oracle.com>
2024-12-28 15:35:10 -08:00
BubbleCal
aec8332eb5 chore: add dynamic = ["version"] to pass build check (#1977)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-12-28 10:45:23 -08:00
Lance Release
46061070e6 Updating package-lock.json 2024-12-26 07:40:12 +00:00
412 changed files with 36788 additions and 7902 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion] [tool.bumpversion]
current_version = "0.14.1" current_version = "0.19.1-beta.1"
parse = """(?x) parse = """(?x)
(?P<major>0|[1-9]\\d*)\\. (?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\. (?P<minor>0|[1-9]\\d*)\\.
@@ -87,26 +87,11 @@ glob = "node/package.json"
replace = "\"@lancedb/vectordb-linux-x64-gnu\": \"{new_version}\"" replace = "\"@lancedb/vectordb-linux-x64-gnu\": \"{new_version}\""
search = "\"@lancedb/vectordb-linux-x64-gnu\": \"{current_version}\"" search = "\"@lancedb/vectordb-linux-x64-gnu\": \"{current_version}\""
[[tool.bumpversion.files]]
glob = "node/package.json"
replace = "\"@lancedb/vectordb-linux-arm64-musl\": \"{new_version}\""
search = "\"@lancedb/vectordb-linux-arm64-musl\": \"{current_version}\""
[[tool.bumpversion.files]]
glob = "node/package.json"
replace = "\"@lancedb/vectordb-linux-x64-musl\": \"{new_version}\""
search = "\"@lancedb/vectordb-linux-x64-musl\": \"{current_version}\""
[[tool.bumpversion.files]] [[tool.bumpversion.files]]
glob = "node/package.json" glob = "node/package.json"
replace = "\"@lancedb/vectordb-win32-x64-msvc\": \"{new_version}\"" replace = "\"@lancedb/vectordb-win32-x64-msvc\": \"{new_version}\""
search = "\"@lancedb/vectordb-win32-x64-msvc\": \"{current_version}\"" search = "\"@lancedb/vectordb-win32-x64-msvc\": \"{current_version}\""
[[tool.bumpversion.files]]
glob = "node/package.json"
replace = "\"@lancedb/vectordb-win32-arm64-msvc\": \"{new_version}\""
search = "\"@lancedb/vectordb-win32-arm64-msvc\": \"{current_version}\""
# Cargo files # Cargo files
# ------------ # ------------
[[tool.bumpversion.files]] [[tool.bumpversion.files]]

View File

@@ -34,6 +34,10 @@ rustflags = ["-C", "target-cpu=haswell", "-C", "target-feature=+avx2,+fma,+f16c"
[target.x86_64-unknown-linux-musl] [target.x86_64-unknown-linux-musl]
rustflags = ["-C", "target-cpu=haswell", "-C", "target-feature=-crt-static,+avx2,+fma,+f16c"] rustflags = ["-C", "target-cpu=haswell", "-C", "target-feature=-crt-static,+avx2,+fma,+f16c"]
[target.aarch64-unknown-linux-musl]
linker = "aarch64-linux-musl-gcc"
rustflags = ["-C", "target-feature=-crt-static"]
[target.aarch64-apple-darwin] [target.aarch64-apple-darwin]
rustflags = ["-C", "target-cpu=apple-m1", "-C", "target-feature=+neon,+fp16,+fhm,+dotprod"] rustflags = ["-C", "target-cpu=apple-m1", "-C", "target-feature=+neon,+fp16,+fhm,+dotprod"]

View File

@@ -36,8 +36,7 @@ runs:
args: ${{ inputs.args }} args: ${{ inputs.args }}
before-script-linux: | before-script-linux: |
set -e set -e
yum install -y openssl-devel \ curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$(uname -m).zip > /tmp/protoc.zip \
&& curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$(uname -m).zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \ && unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip && rm /tmp/protoc.zip
- name: Build Arm Manylinux Wheel - name: Build Arm Manylinux Wheel
@@ -52,12 +51,7 @@ runs:
args: ${{ inputs.args }} args: ${{ inputs.args }}
before-script-linux: | before-script-linux: |
set -e set -e
apt install -y unzip yum install -y clang \
if [ $(uname -m) = "x86_64" ]; then && curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-aarch_64.zip > /tmp/protoc.zip \
PROTOC_ARCH="x86_64"
else
PROTOC_ARCH="aarch_64"
fi
curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$PROTOC_ARCH.zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \ && unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip && rm /tmp/protoc.zip

View File

@@ -20,7 +20,7 @@ runs:
uses: PyO3/maturin-action@v1 uses: PyO3/maturin-action@v1
with: with:
command: build command: build
# TODO: pass through interpreter
args: ${{ inputs.args }} args: ${{ inputs.args }}
docker-options: "-e PIP_EXTRA_INDEX_URL=https://pypi.fury.io/lancedb/" docker-options: "-e PIP_EXTRA_INDEX_URL=https://pypi.fury.io/lancedb/"
working-directory: python working-directory: python
interpreter: 3.${{ inputs.python-minor-version }}

View File

@@ -28,7 +28,7 @@ runs:
args: ${{ inputs.args }} args: ${{ inputs.args }}
docker-options: "-e PIP_EXTRA_INDEX_URL=https://pypi.fury.io/lancedb/" docker-options: "-e PIP_EXTRA_INDEX_URL=https://pypi.fury.io/lancedb/"
working-directory: python working-directory: python
- uses: actions/upload-artifact@v3 - uses: actions/upload-artifact@v4
with: with:
name: windows-wheels name: windows-wheels
path: python\target\wheels path: python\target\wheels

View File

@@ -18,17 +18,24 @@ concurrency:
group: "pages" group: "pages"
cancel-in-progress: true cancel-in-progress: true
env:
# This reduces the disk space needed for the build
RUSTFLAGS: "-C debuginfo=0"
# according to: https://matklad.github.io/2021/09/04/fast-rust-builds.html
# CI builds are faster with incremental disabled.
CARGO_INCREMENTAL: "0"
jobs: jobs:
# Single deploy job since we're just deploying # Single deploy job since we're just deploying
build: build:
environment: environment:
name: github-pages name: github-pages
url: ${{ steps.deployment.outputs.page_url }} url: ${{ steps.deployment.outputs.page_url }}
runs-on: buildjet-8vcpu-ubuntu-2204 runs-on: ubuntu-24.04
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v4
- name: Install dependecies needed for ubuntu - name: Install dependencies needed for ubuntu
run: | run: |
sudo apt install -y protobuf-compiler libssl-dev sudo apt install -y protobuf-compiler libssl-dev
rustup update && rustup default rustup update && rustup default
@@ -38,6 +45,7 @@ jobs:
python-version: "3.10" python-version: "3.10"
cache: "pip" cache: "pip"
cache-dependency-path: "docs/requirements.txt" cache-dependency-path: "docs/requirements.txt"
- uses: Swatinem/rust-cache@v2
- name: Build Python - name: Build Python
working-directory: python working-directory: python
run: | run: |
@@ -49,7 +57,6 @@ jobs:
node-version: 20 node-version: 20
cache: 'npm' cache: 'npm'
cache-dependency-path: node/package-lock.json cache-dependency-path: node/package-lock.json
- uses: Swatinem/rust-cache@v2
- name: Install node dependencies - name: Install node dependencies
working-directory: node working-directory: node
run: | run: |

View File

@@ -43,7 +43,7 @@ jobs:
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
- uses: actions-rust-lang/setup-rust-toolchain@v1 - uses: actions-rust-lang/setup-rust-toolchain@v1
with: with:
toolchain: "1.79.0" toolchain: "1.81.0"
cache-workspaces: "./java/core/lancedb-jni" cache-workspaces: "./java/core/lancedb-jni"
# Disable full debug symbol generation to speed up CI build and keep memory down # Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks. # "1" means line tables only, which is useful for panic tracebacks.
@@ -97,7 +97,7 @@ jobs:
- name: Dry run - name: Dry run
if: github.event_name == 'pull_request' if: github.event_name == 'pull_request'
run: | run: |
mvn --batch-mode -DskipTests package mvn --batch-mode -DskipTests -Drust.release.build=true package
- name: Set github - name: Set github
run: | run: |
git config --global user.email "LanceDB Github Runner" git config --global user.email "LanceDB Github Runner"
@@ -108,7 +108,7 @@ jobs:
echo "use-agent" >> ~/.gnupg/gpg.conf echo "use-agent" >> ~/.gnupg/gpg.conf
echo "pinentry-mode loopback" >> ~/.gnupg/gpg.conf echo "pinentry-mode loopback" >> ~/.gnupg/gpg.conf
export GPG_TTY=$(tty) export GPG_TTY=$(tty)
mvn --batch-mode -DskipTests -DpushChanges=false -Dgpg.passphrase=${{ secrets.GPG_PASSPHRASE }} deploy -P deploy-to-ossrh mvn --batch-mode -DskipTests -Drust.release.build=true -DpushChanges=false -Dgpg.passphrase=${{ secrets.GPG_PASSPHRASE }} deploy -P deploy-to-ossrh
env: env:
SONATYPE_USER: ${{ secrets.SONATYPE_USER }} SONATYPE_USER: ${{ secrets.SONATYPE_USER }}
SONATYPE_TOKEN: ${{ secrets.SONATYPE_TOKEN }} SONATYPE_TOKEN: ${{ secrets.SONATYPE_TOKEN }}

View File

@@ -0,0 +1,31 @@
name: Check license headers
on:
push:
branches:
- main
pull_request:
paths:
- rust/**
- python/**
- nodejs/**
- java/**
- .github/workflows/license-header-check.yml
jobs:
check-licenses:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Install license-header-checker
working-directory: /tmp
run: |
curl -s https://raw.githubusercontent.com/lluissm/license-header-checker/master/install.sh | bash
mv /tmp/bin/license-header-checker /usr/local/bin/
- name: Check license headers (rust)
run: license-header-checker -a -v ./rust/license_header.txt ./ rs && [[ -z `git status -s` ]]
- name: Check license headers (python)
run: license-header-checker -a -v ./python/license_header.txt python py && [[ -z `git status -s` ]]
- name: Check license headers (typescript)
run: license-header-checker -a -v ./nodejs/license_header.txt nodejs ts && [[ -z `git status -s` ]]
- name: Check license headers (java)
run: license-header-checker -a -v ./nodejs/license_header.txt java java && [[ -z `git status -s` ]]

View File

@@ -43,7 +43,7 @@ on:
jobs: jobs:
make-release: make-release:
# Creates tag and GH release. The GH release will trigger the build and release jobs. # Creates tag and GH release. The GH release will trigger the build and release jobs.
runs-on: ubuntu-latest runs-on: ubuntu-24.04
permissions: permissions:
contents: write contents: write
steps: steps:
@@ -57,15 +57,14 @@ jobs:
# trigger any workflows watching for new tags. See: # trigger any workflows watching for new tags. See:
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow # https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
token: ${{ secrets.LANCEDB_RELEASE_TOKEN }} token: ${{ secrets.LANCEDB_RELEASE_TOKEN }}
- name: Validate Lance dependency is at stable version
if: ${{ inputs.type == 'stable' }}
run: python ci/validate_stable_lance.py
- name: Set git configs for bumpversion - name: Set git configs for bumpversion
shell: bash shell: bash
run: | run: |
git config user.name 'Lance Release' git config user.name 'Lance Release'
git config user.email 'lance-dev@lancedb.com' git config user.email 'lance-dev@lancedb.com'
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Bump Python version - name: Bump Python version
if: ${{ inputs.python }} if: ${{ inputs.python }}
working-directory: python working-directory: python

View File

@@ -106,6 +106,18 @@ jobs:
python ci/mock_openai.py & python ci/mock_openai.py &
cd nodejs/examples cd nodejs/examples
npm test npm test
- name: Check docs
run: |
# We run this as part of the job because the binary needs to be built
# first to export the types of the native code.
set -e
npm ci
npm run docs
if ! git diff --exit-code; then
echo "Docs need to be updated"
echo "Run 'npm run docs', fix any warnings, and commit the changes."
exit 1
fi
macos: macos:
timeout-minutes: 30 timeout-minutes: 30
runs-on: "macos-14" runs-on: "macos-14"

File diff suppressed because it is too large Load Diff

View File

@@ -4,6 +4,11 @@ on:
push: push:
tags: tags:
- 'python-v*' - 'python-v*'
pull_request:
# This should trigger a dry run (we skip the final publish step)
paths:
- .github/workflows/pypi-publish.yml
- Cargo.toml # Change in dependency frequently breaks builds
jobs: jobs:
linux: linux:
@@ -15,15 +20,21 @@ jobs:
- platform: x86_64 - platform: x86_64
manylinux: "2_17" manylinux: "2_17"
extra_args: "" extra_args: ""
runner: ubuntu-22.04
- platform: x86_64 - platform: x86_64
manylinux: "2_28" manylinux: "2_28"
extra_args: "--features fp16kernels" extra_args: "--features fp16kernels"
runner: ubuntu-22.04
- platform: aarch64 - platform: aarch64
manylinux: "2_24" manylinux: "2_17"
extra_args: "" extra_args: ""
# We don't build fp16 kernels for aarch64, because it uses # For successful fat LTO builds, we need a large runner to avoid OOM errors.
# cross compilation image, which doesn't have a new enough compiler. runner: ubuntu-2404-8x-arm64
runs-on: "ubuntu-22.04" - platform: aarch64
manylinux: "2_28"
extra_args: "--features fp16kernels"
runner: ubuntu-2404-8x-arm64
runs-on: ${{ matrix.config.runner }}
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
with: with:
@@ -40,6 +51,7 @@ jobs:
arm-build: ${{ matrix.config.platform == 'aarch64' }} arm-build: ${{ matrix.config.platform == 'aarch64' }}
manylinux: ${{ matrix.config.manylinux }} manylinux: ${{ matrix.config.manylinux }}
- uses: ./.github/workflows/upload_wheel - uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with: with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }} pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }} fury_token: ${{ secrets.FURY_TOKEN }}
@@ -69,6 +81,7 @@ jobs:
python-minor-version: 8 python-minor-version: 8
args: "--release --strip --target ${{ matrix.config.target }} --features fp16kernels" args: "--release --strip --target ${{ matrix.config.target }} --features fp16kernels"
- uses: ./.github/workflows/upload_wheel - uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with: with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }} pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }} fury_token: ${{ secrets.FURY_TOKEN }}
@@ -90,10 +103,12 @@ jobs:
args: "--release --strip" args: "--release --strip"
vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }} vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }}
- uses: ./.github/workflows/upload_wheel - uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with: with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }} pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }} fury_token: ${{ secrets.FURY_TOKEN }}
gh-release: gh-release:
if: startsWith(github.ref, 'refs/tags/python-v')
runs-on: ubuntu-latest runs-on: ubuntu-latest
permissions: permissions:
contents: write contents: write

View File

@@ -13,6 +13,11 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true cancel-in-progress: true
env:
# Color output for pytest is off by default.
PYTEST_ADDOPTS: "--color=yes"
FORCE_COLOR: "1"
jobs: jobs:
lint: lint:
name: "Lint" name: "Lint"
@@ -30,16 +35,17 @@ jobs:
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
with: with:
python-version: "3.11" python-version: "3.12"
- name: Install ruff - name: Install ruff
run: | run: |
pip install ruff==0.5.4 pip install ruff==0.9.9
- name: Format check - name: Format check
run: ruff format --check . run: ruff format --check .
- name: Lint - name: Lint
run: ruff check . run: ruff check .
doctest:
name: "Doctest" type-check:
name: "Type Check"
timeout-minutes: 30 timeout-minutes: 30
runs-on: "ubuntu-22.04" runs-on: "ubuntu-22.04"
defaults: defaults:
@@ -54,7 +60,36 @@ jobs:
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
with: with:
python-version: "3.11" python-version: "3.12"
- name: Install protobuf compiler
run: |
sudo apt update
sudo apt install -y protobuf-compiler
pip install toml
- name: Install dependencies
run: |
python ../ci/parse_requirements.py pyproject.toml --extras dev,tests,embeddings > requirements.txt
pip install -r requirements.txt
- name: Run pyright
run: pyright
doctest:
name: "Doctest"
timeout-minutes: 30
runs-on: "ubuntu-24.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: "pip" cache: "pip"
- name: Install protobuf - name: Install protobuf
run: | run: |
@@ -75,8 +110,8 @@ jobs:
timeout-minutes: 30 timeout-minutes: 30
strategy: strategy:
matrix: matrix:
python-minor-version: ["9", "11"] python-minor-version: ["9", "12"]
runs-on: "ubuntu-22.04" runs-on: "ubuntu-24.04"
defaults: defaults:
run: run:
shell: bash shell: bash
@@ -101,6 +136,10 @@ jobs:
- uses: ./.github/workflows/run_tests - uses: ./.github/workflows/run_tests
with: with:
integration: true integration: true
- name: Test without pylance or pandas
run: |
pip uninstall -y pylance pandas
pytest -vv python/tests/test_table.py
# Make sure wheels are not included in the Rust cache # Make sure wheels are not included in the Rust cache
- name: Delete wheels - name: Delete wheels
run: rm -rf target/wheels run: rm -rf target/wheels
@@ -127,7 +166,7 @@ jobs:
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
with: with:
python-version: "3.11" python-version: "3.12"
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
with: with:
workspaces: python workspaces: python
@@ -157,7 +196,7 @@ jobs:
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v5
with: with:
python-version: "3.11" python-version: "3.12"
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
with: with:
workspaces: python workspaces: python
@@ -168,7 +207,7 @@ jobs:
run: rm -rf target/wheels run: rm -rf target/wheels
pydantic1x: pydantic1x:
timeout-minutes: 30 timeout-minutes: 30
runs-on: "ubuntu-22.04" runs-on: "ubuntu-24.04"
defaults: defaults:
run: run:
shell: bash shell: bash
@@ -189,6 +228,7 @@ jobs:
- name: Install lancedb - name: Install lancedb
run: | run: |
pip install "pydantic<2" pip install "pydantic<2"
pip install pyarrow==16
pip install --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests] pip install --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests]
pip install tantivy pip install tantivy
- name: Run tests - name: Run tests

View File

@@ -22,6 +22,7 @@ env:
# "1" means line tables only, which is useful for panic tracebacks. # "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1" RUSTFLAGS: "-C debuginfo=1"
RUST_BACKTRACE: "1" RUST_BACKTRACE: "1"
CARGO_INCREMENTAL: 0
jobs: jobs:
lint: lint:
@@ -51,6 +52,33 @@ jobs:
- name: Run clippy - name: Run clippy
run: cargo clippy --workspace --tests --all-features -- -D warnings run: cargo clippy --workspace --tests --all-features -- -D warnings
build-no-lock:
runs-on: ubuntu-24.04
timeout-minutes: 30
env:
# Need up-to-date compilers for kernels
CC: clang
CXX: clang++
steps:
- uses: actions/checkout@v4
# Building without a lock file often requires the latest Rust version since downstream
# dependencies may have updated their minimum Rust version.
- uses: actions-rust-lang/setup-rust-toolchain@v1
with:
toolchain: "stable"
# Remove cargo.lock to force a fresh build
- name: Remove Cargo.lock
run: rm -f Cargo.lock
- uses: rui314/setup-mold@v1
- uses: Swatinem/rust-cache@v2
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build all
run: |
cargo build --benches --all-features --tests
linux: linux:
timeout-minutes: 30 timeout-minutes: 30
# To build all features, we need more disk space than is available # To build all features, we need more disk space than is available
@@ -75,8 +103,11 @@ jobs:
workspaces: rust workspaces: rust
- name: Install dependencies - name: Install dependencies
run: | run: |
sudo apt update # This shaves 2 minutes off this step in CI. This doesn't seem to be
# necessary in standard runners, but it is in the 4x runners.
sudo rm /var/lib/man-db/auto-update
sudo apt install -y protobuf-compiler libssl-dev sudo apt install -y protobuf-compiler libssl-dev
- uses: rui314/setup-mold@v1
- name: Make Swap - name: Make Swap
run: | run: |
sudo fallocate -l 16G /swapfile sudo fallocate -l 16G /swapfile
@@ -87,11 +118,11 @@ jobs:
working-directory: . working-directory: .
run: docker compose up --detach --wait run: docker compose up --detach --wait
- name: Build - name: Build
run: cargo build --all-features run: cargo build --all-features --tests --locked --examples
- name: Run tests - name: Run tests
run: cargo test --all-features run: cargo test --all-features --locked
- name: Run examples - name: Run examples
run: cargo run --example simple run: cargo run --example simple --locked
macos: macos:
timeout-minutes: 30 timeout-minutes: 30
@@ -115,129 +146,43 @@ jobs:
workspaces: rust workspaces: rust
- name: Install dependencies - name: Install dependencies
run: brew install protobuf run: brew install protobuf
- name: Build
run: cargo build --all-features
- name: Run tests - name: Run tests
# Run with everything except the integration tests. run: |
run: cargo test --features remote,fp16kernels # Don't run the s3 integration tests since docker isn't available
# on this image.
ALL_FEATURES=`cargo metadata --format-version=1 --no-deps \
| jq -r '.packages[] | .features | keys | .[]' \
| grep -v s3-test | sort | uniq | paste -s -d "," -`
cargo test --features $ALL_FEATURES --locked
windows: windows:
runs-on: windows-2022 runs-on: windows-2022
strategy:
matrix:
target:
- x86_64-pc-windows-msvc
- aarch64-pc-windows-msvc
defaults:
run:
working-directory: rust/lancedb
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
- uses: Swatinem/rust-cache@v2 - uses: Swatinem/rust-cache@v2
with: with:
workspaces: rust workspaces: rust
- name: Install Protoc v21.12 - name: Install Protoc v21.12
working-directory: C:\ run: choco install --no-progress protoc
- name: Build
run: | run: |
New-Item -Path 'C:\protoc' -ItemType Directory rustup target add ${{ matrix.target }}
Set-Location C:\protoc $env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip cargo build --features remote --tests --locked --target ${{ matrix.target }}
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Run tests - name: Run tests
# Can only run tests when target matches host
if: ${{ matrix.target == 'x86_64-pc-windows-msvc' }}
run: | run: |
$env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT $env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT
cargo build cargo test --features remote --locked
cargo test
windows-arm64:
runs-on: windows-4x-arm
steps:
- name: Install Git
run: |
Invoke-WebRequest -Uri "https://github.com/git-for-windows/git/releases/download/v2.44.0.windows.1/Git-2.44.0-64-bit.exe" -OutFile "git-installer.exe"
Start-Process -FilePath "git-installer.exe" -ArgumentList "/VERYSILENT", "/NORESTART" -Wait
shell: powershell
- name: Add Git to PATH
run: |
Add-Content $env:GITHUB_PATH "C:\Program Files\Git\bin"
$env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path","User")
shell: powershell
- name: Configure Git symlinks
run: git config --global core.symlinks true
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.13"
- name: Install Visual Studio Build Tools
run: |
Invoke-WebRequest -Uri "https://aka.ms/vs/17/release/vs_buildtools.exe" -OutFile "vs_buildtools.exe"
Start-Process -FilePath "vs_buildtools.exe" -ArgumentList "--quiet", "--wait", "--norestart", "--nocache", `
"--installPath", "C:\BuildTools", `
"--add", "Microsoft.VisualStudio.Component.VC.Tools.ARM64", `
"--add", "Microsoft.VisualStudio.Component.VC.Tools.x86.x64", `
"--add", "Microsoft.VisualStudio.Component.Windows11SDK.22621", `
"--add", "Microsoft.VisualStudio.Component.VC.ATL", `
"--add", "Microsoft.VisualStudio.Component.VC.ATLMFC", `
"--add", "Microsoft.VisualStudio.Component.VC.Llvm.Clang" -Wait
shell: powershell
- name: Add Visual Studio Build Tools to PATH
run: |
$vsPath = "C:\BuildTools\VC\Tools\MSVC"
$latestVersion = (Get-ChildItem $vsPath | Sort-Object {[version]$_.Name} -Descending)[0].Name
Add-Content $env:GITHUB_PATH "C:\BuildTools\VC\Tools\MSVC\$latestVersion\bin\Hostx64\arm64"
Add-Content $env:GITHUB_PATH "C:\BuildTools\VC\Tools\MSVC\$latestVersion\bin\Hostx64\x64"
Add-Content $env:GITHUB_PATH "C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\arm64"
Add-Content $env:GITHUB_PATH "C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64"
Add-Content $env:GITHUB_PATH "C:\BuildTools\VC\Tools\Llvm\x64\bin"
# Add MSVC runtime libraries to LIB
$env:LIB = "C:\BuildTools\VC\Tools\MSVC\$latestVersion\lib\arm64;" +
"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\arm64;" +
"C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\arm64"
Add-Content $env:GITHUB_ENV "LIB=$env:LIB"
# Add INCLUDE paths
$env:INCLUDE = "C:\BuildTools\VC\Tools\MSVC\$latestVersion\include;" +
"C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\ucrt;" +
"C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\um;" +
"C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\shared"
Add-Content $env:GITHUB_ENV "INCLUDE=$env:INCLUDE"
shell: powershell
- name: Install Rust
run: |
Invoke-WebRequest https://win.rustup.rs/x86_64 -OutFile rustup-init.exe
.\rustup-init.exe -y --default-host aarch64-pc-windows-msvc
shell: powershell
- name: Add Rust to PATH
run: |
Add-Content $env:GITHUB_PATH "$env:USERPROFILE\.cargo\bin"
shell: powershell
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install 7-Zip ARM
run: |
New-Item -Path 'C:\7zip' -ItemType Directory
Invoke-WebRequest https://7-zip.org/a/7z2408-arm64.exe -OutFile C:\7zip\7z-installer.exe
Start-Process -FilePath C:\7zip\7z-installer.exe -ArgumentList '/S' -Wait
shell: powershell
- name: Add 7-Zip to PATH
run: Add-Content $env:GITHUB_PATH "C:\Program Files\7-Zip"
shell: powershell
- name: Install Protoc v21.12
working-directory: C:\
run: |
if (Test-Path 'C:\protoc') {
Write-Host "Protoc directory exists, skipping installation"
return
}
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
& 'C:\Program Files\7-Zip\7z.exe' x protoc.zip
shell: powershell
- name: Add Protoc to PATH
run: Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Run tests
run: |
$env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT
cargo build --target aarch64-pc-windows-msvc
cargo test --target aarch64-pc-windows-msvc
msrv: msrv:
# Check the minimum supported Rust version # Check the minimum supported Rust version

3
.gitignore vendored
View File

@@ -9,7 +9,6 @@ venv
.vscode .vscode
.zed .zed
rust/target rust/target
rust/Cargo.lock
site site
@@ -42,5 +41,3 @@ dist
target target
**/sccache.log **/sccache.log
Cargo.lock

View File

@@ -7,9 +7,15 @@ repos:
- id: trailing-whitespace - id: trailing-whitespace
- repo: https://github.com/astral-sh/ruff-pre-commit - repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version. # Ruff version.
rev: v0.2.2 rev: v0.9.9
hooks: hooks:
- id: ruff - id: ruff
# - repo: https://github.com/RobertCraigie/pyright-python
# rev: v1.1.395
# hooks:
# - id: pyright
# args: ["--project", "python"]
# additional_dependencies: [pyarrow-stubs]
- repo: local - repo: local
hooks: hooks:
- id: local-biome-check - id: local-biome-check

78
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,78 @@
# Contributing to LanceDB
LanceDB is an open-source project and we welcome contributions from the community.
This document outlines the process for contributing to LanceDB.
## Reporting Issues
If you encounter a bug or have a feature request, please open an issue on the
[GitHub issue tracker](https://github.com/lancedb/lancedb).
## Picking an issue
We track issues on the GitHub issue tracker. If you are looking for something to
work on, check the [good first issue](https://github.com/lancedb/lancedb/contribute) label. These issues are typically the best described and have the smallest scope.
If there's an issue you are interested in working on, please leave a comment on the issue. This will help us avoid duplicate work. Additionally, if you have questions about the issue, please ask them in the issue comments. We are happy to provide guidance on how to approach the issue.
## Configuring Git
First, fork the repository on GitHub, then clone your fork:
```bash
git clone https://github.com/<username>/lancedb.git
cd lancedb
```
Then add the main repository as a remote:
```bash
git remote add upstream https://github.com/lancedb/lancedb.git
git fetch upstream
```
## Setting up your development environment
We have development environments for Python, Typescript, and Java. Each environment has its own setup instructions.
* [Python](python/CONTRIBUTING.md)
* [Typescript](nodejs/CONTRIBUTING.md)
<!-- TODO: add Java contributing guide -->
* [Documentation](docs/README.md)
## Best practices for pull requests
For the best chance of having your pull request accepted, please follow these guidelines:
1. Unit test all bug fixes and new features. Your code will not be merged if it
doesn't have tests.
1. If you change the public API, update the documentation in the `docs` directory.
1. Aim to minimize the number of changes in each pull request. Keep to solving
one problem at a time, when possible.
1. Before marking a pull request ready-for-review, do a self review of your code.
Is it clear why you are making the changes? Are the changes easy to understand?
1. Use [conventional commit messages](https://www.conventionalcommits.org/en/) as pull request titles. Examples:
* New feature: `feat: adding foo API`
* Bug fix: `fix: issue with foo API`
* Documentation change: `docs: adding foo API documentation`
1. If your pull request is a work in progress, leave the pull request as a draft.
We will assume the pull request is ready for review when it is opened.
1. When writing tests, test the error cases. Make sure they have understandable
error messages.
## Project structure
The core library is written in Rust. The Python, Typescript, and Java libraries
are wrappers around the Rust library.
* `src/lancedb`: Rust library source code
* `python`: Python package source code
* `nodejs`: Typescript package source code
* `node`: **Deprecated** Typescript package source code
* `java`: Java package source code
* `docs`: Documentation source code
## Release process
For information on the release process, see: [release_process.md](release_process.md)

8603
Cargo.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -21,41 +21,55 @@ categories = ["database-implementations"]
rust-version = "1.78.0" rust-version = "1.78.0"
[workspace.dependencies] [workspace.dependencies]
lance = { "version" = "=0.21.0", "features" = [ lance = { "version" = "=0.27.0", "features" = ["dynamodb"], tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
"dynamodb", lance-io = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
], git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-index = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-io = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-linalg = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-index = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-table = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-linalg = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-testing = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-table = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-datafusion = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-testing = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" } lance-encoding = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-datafusion = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" }
lance-encoding = { version = "=0.21.0", git = "https://github.com/lancedb/lance.git", tag = "v0.21.0-beta.5" }
# Note that this one does not include pyarrow # Note that this one does not include pyarrow
arrow = { version = "53.2", optional = false } arrow = { version = "54.1", optional = false }
arrow-array = "53.2" arrow-array = "54.1"
arrow-data = "53.2" arrow-data = "54.1"
arrow-ipc = "53.2" arrow-ipc = "54.1"
arrow-ord = "53.2" arrow-ord = "54.1"
arrow-schema = "53.2" arrow-schema = "54.1"
arrow-arith = "53.2" arrow-arith = "54.1"
arrow-cast = "53.2" arrow-cast = "54.1"
async-trait = "0" async-trait = "0"
chrono = "0.4.35" datafusion = { version = "46.0", default-features = false }
datafusion-common = "42.0" datafusion-catalog = "46.0"
datafusion-physical-plan = "42.0" datafusion-common = { version = "46.0", default-features = false }
env_logger = "0.10" datafusion-execution = "46.0"
datafusion-expr = "46.0"
datafusion-physical-plan = "46.0"
env_logger = "0.11"
half = { "version" = "=2.4.1", default-features = false, features = [ half = { "version" = "=2.4.1", default-features = false, features = [
"num-traits", "num-traits",
] } ] }
futures = "0" futures = "0"
log = "0.4" log = "0.4"
moka = { version = "0.11", features = ["future"] } moka = { version = "0.12", features = ["future"] }
object_store = "0.10.2" object_store = "0.11.0"
pin-project = "1.0.7" pin-project = "1.0.7"
snafu = "0.7.4" snafu = "0.8"
url = "2" url = "2"
num-traits = "0.2" num-traits = "0.2"
rand = "0.8" rand = "0.8"
regex = "1.10" regex = "1.10"
lazy_static = "1" lazy_static = "1"
semver = "1.0.25"
# Temporary pins to work around downstream issues
# https://github.com/apache/arrow-rs/commit/2fddf85afcd20110ce783ed5b4cdeb82293da30b
chrono = "=0.4.39"
# https://github.com/RustCrypto/formats/issues/1684
base64ct = "=1.6.0"
# Workaround for: https://github.com/eira-fransham/crunchy/issues/13
crunchy = "=0.2.2"
# Workaround for: https://github.com/Lokathor/bytemuck/issues/306
bytemuck_derive = ">=1.8.1, <1.9.0"

View File

@@ -1,9 +1,17 @@
<a href="https://cloud.lancedb.com" target="_blank">
<img src="https://github.com/user-attachments/assets/92dad0a2-2a37-4ce1-b783-0d1b4f30a00c" alt="LanceDB Cloud Public Beta" width="100%" style="max-width: 100%;">
</a>
<div align="center"> <div align="center">
<p align="center"> <p align="center">
<img width="275" alt="LanceDB Logo" src="https://github.com/lancedb/lancedb/assets/5846846/37d7c7ad-c2fd-4f56-9f16-fffb0d17c73a"> <picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/user-attachments/assets/ac270358-333e-4bea-a132-acefaa94040e">
<source media="(prefers-color-scheme: light)" srcset="https://github.com/user-attachments/assets/b864d814-0d29-4784-8fd9-807297c758c0">
<img alt="LanceDB Logo" src="https://github.com/user-attachments/assets/b864d814-0d29-4784-8fd9-807297c758c0" width=300>
</picture>
**Developer-friendly, database for multimodal AI** **Search More, Manage Less**
<a href='https://github.com/lancedb/vectordb-recipes/tree/main' target="_blank"><img alt='LanceDB' src='https://img.shields.io/badge/VectorDB_Recipes-100000?style=for-the-badge&logo=LanceDB&logoColor=white&labelColor=645cfb&color=645cfb'/></a> <a href='https://github.com/lancedb/vectordb-recipes/tree/main' target="_blank"><img alt='LanceDB' src='https://img.shields.io/badge/VectorDB_Recipes-100000?style=for-the-badge&logo=LanceDB&logoColor=white&labelColor=645cfb&color=645cfb'/></a>
<a href='https://lancedb.github.io/lancedb/' target="_blank"><img alt='lancdb' src='https://img.shields.io/badge/DOCS-100000?style=for-the-badge&logo=lancdb&logoColor=white&labelColor=645cfb&color=645cfb'/></a> <a href='https://lancedb.github.io/lancedb/' target="_blank"><img alt='lancdb' src='https://img.shields.io/badge/DOCS-100000?style=for-the-badge&logo=lancdb&logoColor=white&labelColor=645cfb&color=645cfb'/></a>

View File

@@ -1,21 +0,0 @@
#!/bin/bash
set -e
ARCH=${1:-x86_64}
# We pass down the current user so that when we later mount the local files
# into the container, the files are accessible by the current user.
pushd ci/manylinux_node
docker build \
-t lancedb-node-manylinux-$ARCH \
--build-arg="ARCH=$ARCH" \
--build-arg="DOCKER_USER=$(id -u)" \
--progress=plain \
.
popd
# We turn on memory swap to avoid OOM killer
docker run \
-v $(pwd):/io -w /io \
--memory-swap=-1 \
lancedb-node-manylinux-$ARCH \
bash ci/manylinux_node/build_lancedb.sh $ARCH

View File

@@ -1,34 +0,0 @@
# Builds the macOS artifacts (nodejs binaries).
# Usage: ./ci/build_macos_artifacts_nodejs.sh [target]
# Targets supported: x86_64-apple-darwin aarch64-apple-darwin
set -e
prebuild_rust() {
# Building here for the sake of easier debugging.
pushd rust/lancedb
echo "Building rust library for $1"
export RUST_BACKTRACE=1
cargo build --release --target $1
popd
}
build_node_binaries() {
pushd nodejs
echo "Building nodejs library for $1"
export RUST_TARGET=$1
npm run build-release
popd
}
if [ -n "$1" ]; then
targets=$1
else
targets="x86_64-apple-darwin aarch64-apple-darwin"
fi
echo "Building artifacts for targets: $targets"
for target in $targets
do
prebuild_rust $target
build_node_binaries $target
done

View File

@@ -9,10 +9,6 @@ FROM quay.io/pypa/manylinux_2_28_${ARCH}
ARG ARCH=x86_64 ARG ARCH=x86_64
ARG DOCKER_USER=default_user ARG DOCKER_USER=default_user
# Install static openssl
COPY install_openssl.sh install_openssl.sh
RUN ./install_openssl.sh ${ARCH} > /dev/null
# Protobuf is also installed as root. # Protobuf is also installed as root.
COPY install_protobuf.sh install_protobuf.sh COPY install_protobuf.sh install_protobuf.sh
RUN ./install_protobuf.sh ${ARCH} RUN ./install_protobuf.sh ${ARCH}

View File

@@ -1,19 +0,0 @@
#!/bin/bash
# Builds the nodejs module for manylinux. Invoked by ci/build_linux_artifacts_nodejs.sh.
set -e
ARCH=${1:-x86_64}
if [ "$ARCH" = "x86_64" ]; then
export OPENSSL_LIB_DIR=/usr/local/lib64/
else
export OPENSSL_LIB_DIR=/usr/local/lib/
fi
export OPENSSL_STATIC=1
export OPENSSL_INCLUDE_DIR=/usr/local/include/openssl
#Alpine doesn't have .bashrc
FILE=$HOME/.bashrc && test -f $FILE && source $FILE
cd nodejs
npm ci
npm run build-release

View File

@@ -4,14 +4,6 @@ set -e
ARCH=${1:-x86_64} ARCH=${1:-x86_64}
TARGET_TRIPLE=${2:-x86_64-unknown-linux-gnu} TARGET_TRIPLE=${2:-x86_64-unknown-linux-gnu}
if [ "$ARCH" = "x86_64" ]; then
export OPENSSL_LIB_DIR=/usr/local/lib64/
else
export OPENSSL_LIB_DIR=/usr/local/lib/
fi
export OPENSSL_STATIC=1
export OPENSSL_INCLUDE_DIR=/usr/local/include/openssl
#Alpine doesn't have .bashrc #Alpine doesn't have .bashrc
FILE=$HOME/.bashrc && test -f $FILE && source $FILE FILE=$HOME/.bashrc && test -f $FILE && source $FILE

View File

@@ -1,26 +0,0 @@
#!/bin/bash
# Builds openssl from source so we can statically link to it
# this is to avoid the error we get with the system installation:
# /usr/bin/ld: <library>: version node not found for symbol SSLeay@@OPENSSL_1.0.1
# /usr/bin/ld: failed to set dynamic section sizes: Bad value
set -e
git clone -b OpenSSL_1_1_1v \
--single-branch \
https://github.com/openssl/openssl.git
pushd openssl
if [[ $1 == x86_64* ]]; then
ARCH=linux-x86_64
else
# gnu target
ARCH=linux-aarch64
fi
./Configure no-shared $ARCH
make
make install

41
ci/parse_requirements.py Normal file
View File

@@ -0,0 +1,41 @@
import argparse
import toml
def parse_dependencies(pyproject_path, extras=None):
with open(pyproject_path, "r") as file:
pyproject = toml.load(file)
dependencies = pyproject.get("project", {}).get("dependencies", [])
for dependency in dependencies:
print(dependency)
optional_dependencies = pyproject.get("project", {}).get(
"optional-dependencies", {}
)
if extras:
for extra in extras.split(","):
for dep in optional_dependencies.get(extra, []):
print(dep)
def main():
parser = argparse.ArgumentParser(
description="Generate requirements.txt from pyproject.toml"
)
parser.add_argument("path", type=str, help="Path to pyproject.toml")
parser.add_argument(
"--extras",
type=str,
help="Comma-separated list of extras to include",
default="",
)
args = parser.parse_args()
parse_dependencies(args.path, args.extras)
if __name__ == "__main__":
main()

View File

@@ -53,7 +53,7 @@ curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-42
curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/149578fb3b621cdb61ee1813b9b3e791/463ad1b0783ebda908fd6c16a4abfe93.cab curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/149578fb3b621cdb61ee1813b9b3e791/463ad1b0783ebda908fd6c16a4abfe93.cab
curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/5c986c4f393c6b09d5aec3b539e9fb4a/5a22e5cde814b041749fb271547f4dd5.cab curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/5c986c4f393c6b09d5aec3b539e9fb4a/5a22e5cde814b041749fb271547f4dd5.cab
# fwpuclnt.lib arm64rt.lib # dbghelp.lib fwpuclnt.lib arm64rt.lib
curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/7a332420d812f7c1d41da865ae5a7c52/windows%20sdk%20desktop%20libs%20arm64-x86_en-us.msi curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/7a332420d812f7c1d41da865ae5a7c52/windows%20sdk%20desktop%20libs%20arm64-x86_en-us.msi
curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/19de98ed4a79938d0045d19c047936b3/3e2f7be479e3679d700ce0782e4cc318.cab curl -O https://download.visualstudio.microsoft.com/download/pr/32863b8d-a46d-4231-8e84-0888519d20a9/19de98ed4a79938d0045d19c047936b3/3e2f7be479e3679d700ce0782e4cc318.cab
@@ -98,7 +98,7 @@ find /usr/aarch64-pc-windows-msvc/usr/include -type f -exec sed -i -E 's/(#inclu
# reason: https://developercommunity.visualstudio.com/t/libucrtlibstreamobj-error-lnk2001-unresolved-exter/1544787#T-ND1599818 # reason: https://developercommunity.visualstudio.com/t/libucrtlibstreamobj-error-lnk2001-unresolved-exter/1544787#T-ND1599818
# I don't understand the 'correct' fix for this, arm64rt.lib is supposed to be the workaround # I don't understand the 'correct' fix for this, arm64rt.lib is supposed to be the workaround
(cd 'program files/windows kits/10/lib/10.0.26100.0/um/arm64' && cp advapi32.lib bcrypt.lib kernel32.lib ntdll.lib user32.lib uuid.lib ws2_32.lib userenv.lib cfgmgr32.lib runtimeobject.lib fwpuclnt.lib arm64rt.lib -t /usr/aarch64-pc-windows-msvc/usr/lib) (cd 'program files/windows kits/10/lib/10.0.26100.0/um/arm64' && cp advapi32.lib bcrypt.lib kernel32.lib ntdll.lib user32.lib uuid.lib ws2_32.lib userenv.lib cfgmgr32.lib runtimeobject.lib dbghelp.lib fwpuclnt.lib arm64rt.lib -t /usr/aarch64-pc-windows-msvc/usr/lib)
(cd 'contents/vc/tools/msvc/14.16.27023/lib/arm64' && cp libcmt.lib libvcruntime.lib -t /usr/aarch64-pc-windows-msvc/usr/lib) (cd 'contents/vc/tools/msvc/14.16.27023/lib/arm64' && cp libcmt.lib libvcruntime.lib -t /usr/aarch64-pc-windows-msvc/usr/lib)

View File

@@ -0,0 +1,34 @@
import tomllib
found_preview_lance = False
with open("Cargo.toml", "rb") as f:
cargo_data = tomllib.load(f)
for name, dep in cargo_data["workspace"]["dependencies"].items():
if name == "lance" or name.startswith("lance-"):
if isinstance(dep, str):
version = dep
elif isinstance(dep, dict):
# Version doesn't have the beta tag in it, so we instead look
# at the git tag.
version = dep.get('tag', dep.get('version'))
else:
raise ValueError("Unexpected type for dependency: " + str(dep))
if "beta" in version:
found_preview_lance = True
print(f"Dependency '{name}' is a preview version: {version}")
with open("python/pyproject.toml", "rb") as f:
py_proj_data = tomllib.load(f)
for dep in py_proj_data["project"]["dependencies"]:
if dep.startswith("pylance"):
if "b" in dep:
found_preview_lance = True
print(f"Dependency '{dep}' is a preview version")
break # Only one pylance dependency
if found_preview_lance:
raise ValueError("Found preview version of Lance in dependencies")

View File

@@ -2,43 +2,88 @@
LanceDB docs are deployed to https://lancedb.github.io/lancedb/. LanceDB docs are deployed to https://lancedb.github.io/lancedb/.
Docs is built and deployed automatically by [Github Actions](.github/workflows/docs.yml) Docs is built and deployed automatically by [Github Actions](../.github/workflows/docs.yml)
whenever a commit is pushed to the `main` branch. So it is possible for the docs to show whenever a commit is pushed to the `main` branch. So it is possible for the docs to show
unreleased features. unreleased features.
## Building the docs ## Building the docs
### Setup ### Setup
1. Install LanceDB. From LanceDB repo root: `pip install -e python` 1. Install LanceDB Python. See setup in [Python contributing guide](../python/CONTRIBUTING.md).
2. Install dependencies. From LanceDB repo root: `pip install -r docs/requirements.txt` Run `make develop` to install the Python package.
3. Make sure you have node and npm setup 2. Install documentation dependencies. From LanceDB repo root: `pip install -r docs/requirements.txt`
4. Make sure protobuf and libssl are installed
### Building node module and create markdown files ### Preview the docs
See [Javascript docs README](./src/javascript/README.md) ```shell
### Build docs
From LanceDB repo root:
Run: `PYTHONPATH=. mkdocs build -f docs/mkdocs.yml`
If successful, you should see a `docs/site` directory that you can verify locally.
### Run local server
You can run a local server to test the docs prior to deployment by navigating to the `docs` directory and running the following command:
```bash
cd docs cd docs
mkdocs serve mkdocs serve
``` ```
### Run doctest for typescript example If you want to just generate the HTML files:
```bash ```shell
cd lancedb/docs PYTHONPATH=. mkdocs build -f docs/mkdocs.yml
npm i ```
npm run build
npm run all If successful, you should see a `docs/site` directory that you can verify locally.
## Adding examples
To make sure examples are correct, we put examples in test files so they can be
run as part of our test suites.
You can see the tests are at:
* Python: `python/python/tests/docs`
* Typescript: `nodejs/examples/`
### Checking python examples
```shell
cd python
pytest -vv python/tests/docs
```
### Checking typescript examples
The `@lancedb/lancedb` package must be built before running the tests:
```shell
pushd nodejs
npm ci
npm run build
popd
```
Then you can run the examples by going to the `nodejs/examples` directory and
running the tests like a normal npm package:
```shell
pushd nodejs/examples
npm ci
npm test
popd
```
## API documentation
### Python
The Python API documentation is organized based on the file `docs/src/python/python.md`.
We manually add entries there so we can control the organization of the reference page.
**However, this means any new types must be manually added to the file.** No additional
steps are needed to generate the API documentation.
### Typescript
The typescript API documentation is generated from the typescript source code using [typedoc](https://typedoc.org/).
When new APIs are added, you must manually re-run the typedoc command to update the API documentation.
The new files should be checked into the repository.
```shell
pushd nodejs
npm run docs
popd
``` ```

View File

@@ -4,6 +4,9 @@ repo_url: https://github.com/lancedb/lancedb
edit_uri: https://github.com/lancedb/lancedb/tree/main/docs/src edit_uri: https://github.com/lancedb/lancedb/tree/main/docs/src
repo_name: lancedb/lancedb repo_name: lancedb/lancedb
docs_dir: src docs_dir: src
watch:
- src
- ../python/python
theme: theme:
name: "material" name: "material"
@@ -63,6 +66,7 @@ plugins:
- https://arrow.apache.org/docs/objects.inv - https://arrow.apache.org/docs/objects.inv
- https://pandas.pydata.org/docs/objects.inv - https://pandas.pydata.org/docs/objects.inv
- https://lancedb.github.io/lance/objects.inv - https://lancedb.github.io/lance/objects.inv
- https://docs.pydantic.dev/latest/objects.inv
- mkdocs-jupyter - mkdocs-jupyter
- render_swagger: - render_swagger:
allow_arbitrary_locations: true allow_arbitrary_locations: true
@@ -101,7 +105,8 @@ markdown_extensions:
nav: nav:
- Home: - Home:
- LanceDB: index.md - LanceDB: index.md
- 🏃🏼‍♂️ Quick start: basic.md - 👉 Quickstart: quickstart.md
- 🏃🏼‍♂️ Basic Usage: basic.md
- 📚 Concepts: - 📚 Concepts:
- Vector search: concepts/vector_search.md - Vector search: concepts/vector_search.md
- Indexing: - Indexing:
@@ -120,6 +125,9 @@ nav:
- Overview: hybrid_search/hybrid_search.md - Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md - Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb - Airbnb financial data example: notebooks/hybrid_search.ipynb
- Late interaction with MultiVector search:
- Overview: guides/multi-vector.md
- Example: notebooks/Multivector_on_LanceDB.ipynb
- RAG: - RAG:
- Vanilla RAG: rag/vanilla_rag.md - Vanilla RAG: rag/vanilla_rag.md
- Multi-head RAG: rag/multi_head_rag.md - Multi-head RAG: rag/multi_head_rag.md
@@ -146,7 +154,9 @@ nav:
- Building Custom Rerankers: reranking/custom_reranker.md - Building Custom Rerankers: reranking/custom_reranker.md
- Example: notebooks/lancedb_reranking.ipynb - Example: notebooks/lancedb_reranking.ipynb
- Filtering: sql.md - Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb - Versioning & Reproducibility:
- sync API: notebooks/reproducibility.ipynb
- async API: notebooks/reproducibility_async.ipynb
- Configuring Storage: guides/storage.md - Configuring Storage: guides/storage.md
- Migration Guide: migration.md - Migration Guide: migration.md
- Tuning retrieval performance: - Tuning retrieval performance:
@@ -176,6 +186,7 @@ nav:
- Imagebind embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/imagebind_embedding.md - Imagebind embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/imagebind_embedding.md
- Jina Embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/jina_multimodal_embedding.md - Jina Embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/jina_multimodal_embedding.md
- User-defined embedding functions: embeddings/custom_embedding_function.md - User-defined embedding functions: embeddings/custom_embedding_function.md
- Variables and secrets: embeddings/variables_and_secrets.md
- "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb - "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb
- "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb - "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb
- 🔌 Integrations: - 🔌 Integrations:
@@ -226,15 +237,10 @@ nav:
- 👾 JavaScript (vectordb): javascript/modules.md - 👾 JavaScript (vectordb): javascript/modules.md
- 👾 JavaScript (lancedb): js/globals.md - 👾 JavaScript (lancedb): js/globals.md
- 🦀 Rust: https://docs.rs/lancedb/latest/lancedb/ - 🦀 Rust: https://docs.rs/lancedb/latest/lancedb/
- ☁️ LanceDB Cloud:
- Overview: cloud/index.md
- API reference:
- 🐍 Python: python/saas-python.md
- 👾 JavaScript: javascript/modules.md
- REST API: cloud/rest.md
- FAQs: cloud/cloud_faq.md
- Quick start: basic.md - Getting Started:
- Quickstart: quickstart.md
- Basic Usage: basic.md
- Concepts: - Concepts:
- Vector search: concepts/vector_search.md - Vector search: concepts/vector_search.md
- Indexing: - Indexing:
@@ -253,6 +259,9 @@ nav:
- Overview: hybrid_search/hybrid_search.md - Overview: hybrid_search/hybrid_search.md
- Comparing Rerankers: hybrid_search/eval.md - Comparing Rerankers: hybrid_search/eval.md
- Airbnb financial data example: notebooks/hybrid_search.ipynb - Airbnb financial data example: notebooks/hybrid_search.ipynb
- Late interaction with MultiVector search:
- Overview: guides/multi-vector.md
- Document search Example: notebooks/Multivector_on_LanceDB.ipynb
- RAG: - RAG:
- Vanilla RAG: rag/vanilla_rag.md - Vanilla RAG: rag/vanilla_rag.md
- Multi-head RAG: rag/multi_head_rag.md - Multi-head RAG: rag/multi_head_rag.md
@@ -278,7 +287,9 @@ nav:
- Building Custom Rerankers: reranking/custom_reranker.md - Building Custom Rerankers: reranking/custom_reranker.md
- Example: notebooks/lancedb_reranking.ipynb - Example: notebooks/lancedb_reranking.ipynb
- Filtering: sql.md - Filtering: sql.md
- Versioning & Reproducibility: notebooks/reproducibility.ipynb - Versioning & Reproducibility:
- sync API: notebooks/reproducibility.ipynb
- async API: notebooks/reproducibility_async.ipynb
- Configuring Storage: guides/storage.md - Configuring Storage: guides/storage.md
- Migration Guide: migration.md - Migration Guide: migration.md
- Tuning retrieval performance: - Tuning retrieval performance:
@@ -307,6 +318,7 @@ nav:
- Imagebind embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/imagebind_embedding.md - Imagebind embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/imagebind_embedding.md
- Jina Embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/jina_multimodal_embedding.md - Jina Embeddings: embeddings/available_embedding_models/multimodal_embedding_functions/jina_multimodal_embedding.md
- User-defined embedding functions: embeddings/custom_embedding_function.md - User-defined embedding functions: embeddings/custom_embedding_function.md
- Variables and secrets: embeddings/variables_and_secrets.md
- "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb - "Example: Multi-lingual semantic search": notebooks/multi_lingual_example.ipynb
- "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb - "Example: MultiModal CLIP Embeddings": notebooks/DisappearingEmbeddingFunction.ipynb
- Integrations: - Integrations:
@@ -353,13 +365,6 @@ nav:
- Javascript (vectordb): javascript/modules.md - Javascript (vectordb): javascript/modules.md
- Javascript (lancedb): js/globals.md - Javascript (lancedb): js/globals.md
- Rust: https://docs.rs/lancedb/latest/lancedb/index.html - Rust: https://docs.rs/lancedb/latest/lancedb/index.html
- LanceDB Cloud:
- Overview: cloud/index.md
- API reference:
- 🐍 Python: python/saas-python.md
- 👾 JavaScript: javascript/modules.md
- REST API: cloud/rest.md
- FAQs: cloud/cloud_faq.md
extra_css: extra_css:
- styles/global.css - styles/global.css
@@ -367,6 +372,7 @@ extra_css:
extra_javascript: extra_javascript:
- "extra_js/init_ask_ai_widget.js" - "extra_js/init_ask_ai_widget.js"
- "extra_js/reo.js"
extra: extra:
analytics: analytics:

View File

@@ -38,6 +38,13 @@ components:
required: true required: true
schema: schema:
type: string type: string
index_name:
name: index_name
in: path
description: name of the index
required: true
schema:
type: string
responses: responses:
invalid_request: invalid_request:
description: Invalid request description: Invalid request
@@ -164,7 +171,7 @@ paths:
distance_type: distance_type:
type: string type: string
description: | description: |
The distance metric to use for search. L2, Cosine, Dot and Hamming are supported. Default is L2. The distance metric to use for search. l2, Cosine, Dot and Hamming are supported. Default is l2.
bypass_vector_index: bypass_vector_index:
type: boolean type: boolean
description: | description: |
@@ -443,7 +450,7 @@ paths:
type: string type: string
nullable: false nullable: false
description: | description: |
The metric type to use for the index. L2, Cosine, Dot are supported. The metric type to use for the index. l2, Cosine, Dot are supported.
index_type: index_type:
type: string type: string
responses: responses:
@@ -485,3 +492,22 @@ paths:
$ref: "#/components/responses/unauthorized" $ref: "#/components/responses/unauthorized"
"404": "404":
$ref: "#/components/responses/not_found" $ref: "#/components/responses/not_found"
/v1/table/{name}/index/{index_name}/drop/:
post:
description: Drop an index from the table
tags:
- Tables
summary: Drop an index from the table
operationId: dropIndex
parameters:
- $ref: "#/components/parameters/table_name"
- $ref: "#/components/parameters/index_name"
responses:
"200":
description: Index successfully dropped
"400":
$ref: "#/components/responses/invalid_request"
"401":
$ref: "#/components/responses/unauthorized"
"404":
$ref: "#/components/responses/not_found"

View File

@@ -18,24 +18,23 @@ See the [indexing](concepts/index_ivfpq.md) concepts guide for more information
Lance supports `IVF_PQ` index type by default. Lance supports `IVF_PQ` index type by default.
=== "Python" === "Python"
=== "Sync API"
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
import numpy as np --8<-- "python/python/tests/docs/test_guide_index.py:import-numpy"
uri = "data/sample-lancedb" --8<-- "python/python/tests/docs/test_guide_index.py:create_ann_index"
db = lancedb.connect(uri) ```
=== "Async API"
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
# Create 10,000 sample vectors ```python
data = [{"vector": row, "item": f"item {i}"} --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
for i, row in enumerate(np.random.random((10_000, 1536)).astype('float32'))] --8<-- "python/python/tests/docs/test_guide_index.py:import-numpy"
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-ivfpq"
# Add the vectors to a table --8<-- "python/python/tests/docs/test_guide_index.py:create_ann_index_async"
tbl = db.create_table("my_vectors", data=data)
# Create and train the index - you need to have enough data in the table for an effective training step
tbl.create_index(num_partitions=256, num_sub_vectors=96)
``` ```
=== "TypeScript" === "TypeScript"
@@ -70,7 +69,7 @@ Lance supports `IVF_PQ` index type by default.
The following IVF_PQ paramters can be specified: The following IVF_PQ paramters can be specified:
- **distance_type**: The distance metric to use. By default it uses euclidean distance "`L2`". - **distance_type**: The distance metric to use. By default it uses euclidean distance "`l2`".
We also support "cosine" and "dot" distance as well. We also support "cosine" and "dot" distance as well.
- **num_partitions**: The number of partitions in the index. The default is the square root - **num_partitions**: The number of partitions in the index. The default is the square root
of the number of rows. of the number of rows.
@@ -127,6 +126,8 @@ You can specify the GPU device to train IVF partitions via
accelerator="mps" accelerator="mps"
) )
``` ```
!!! note
GPU based indexing is not yet supported with our asynchronous client.
Troubleshooting: Troubleshooting:
@@ -152,13 +153,15 @@ There are a couple of parameters that can be used to fine-tune the search:
=== "Python" === "Python"
=== "Sync API"
```python ```python
tbl.search(np.random.random((1536))) \ --8<-- "python/python/tests/docs/test_guide_index.py:vector_search"
.limit(2) \ ```
.nprobes(20) \ === "Async API"
.refine_factor(10) \
.to_pandas() ```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async"
``` ```
```text ```text
@@ -196,9 +199,15 @@ The search will return the data requested in addition to the distance of each it
You can further filter the elements returned by a search using a where clause. You can further filter the elements returned by a search using a where clause.
=== "Python" === "Python"
=== "Sync API"
```python ```python
tbl.search(np.random.random((1536))).where("item != 'item 1141'").to_pandas() --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_filter"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async_with_filter"
``` ```
=== "TypeScript" === "TypeScript"
@@ -221,10 +230,16 @@ You can select the columns returned by the query using a select clause.
=== "Python" === "Python"
```python === "Sync API"
tbl.search(np.random.random((1536))).select(["vector"]).to_pandas()
```
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_select"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_async_with_select"
```
```text ```text
vector _distance vector _distance

View File

@@ -3,6 +3,7 @@ import * as vectordb from "vectordb";
// --8<-- [end:import] // --8<-- [end:import]
(async () => { (async () => {
console.log("ann_indexes.ts: start");
// --8<-- [start:ingest] // --8<-- [start:ingest]
const db = await vectordb.connect("data/sample-lancedb"); const db = await vectordb.connect("data/sample-lancedb");
@@ -49,5 +50,5 @@ import * as vectordb from "vectordb";
.execute(); .execute();
// --8<-- [end:search3] // --8<-- [end:search3]
console.log("Ann indexes: done"); console.log("ann_indexes.ts: done");
})(); })();

BIN
docs/src/assets/maxsim.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

View File

@@ -1,4 +1,4 @@
# Quick start # Basic Usage
!!! info "LanceDB can be run in a number of ways:" !!! info "LanceDB can be run in a number of ways:"
@@ -133,11 +133,20 @@ recommend switching to stable releases.
## Connect to a database ## Connect to a database
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:imports" --8<-- "python/python/tests/docs/test_basic.py:imports"
--8<-- "python/python/tests/docs/test_basic.py:connect"
--8<-- "python/python/tests/docs/test_basic.py:set_uri"
--8<-- "python/python/tests/docs/test_basic.py:connect"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:imports"
--8<-- "python/python/tests/docs/test_basic.py:set_uri"
--8<-- "python/python/tests/docs/test_basic.py:connect_async" --8<-- "python/python/tests/docs/test_basic.py:connect_async"
``` ```
@@ -183,19 +192,31 @@ table.
=== "Python" === "Python"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table"
--8<-- "python/python/tests/docs/test_basic.py:create_table_async"
```
If the table already exists, LanceDB will raise an error by default. If the table already exists, LanceDB will raise an error by default.
If you want to overwrite the table, you can pass in `mode="overwrite"` If you want to overwrite the table, you can pass in `mode="overwrite"`
to the `create_table` method. to the `create_table` method.
=== "Sync API"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table"
```
You can also pass in a pandas DataFrame directly: You can also pass in a pandas DataFrame directly:
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:create_table_pandas" --8<-- "python/python/tests/docs/test_basic.py:create_table_pandas"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table_async"
```
You can also pass in a pandas DataFrame directly:
```python
--8<-- "python/python/tests/docs/test_basic.py:create_table_async_pandas" --8<-- "python/python/tests/docs/test_basic.py:create_table_async_pandas"
``` ```
@@ -247,8 +268,14 @@ similar to a `CREATE TABLE` statement in SQL.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:create_empty_table" --8<-- "python/python/tests/docs/test_basic.py:create_empty_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_empty_table_async" --8<-- "python/python/tests/docs/test_basic.py:create_empty_table_async"
``` ```
@@ -281,8 +308,14 @@ Once created, you can open a table as follows:
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:open_table" --8<-- "python/python/tests/docs/test_basic.py:open_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:open_table_async" --8<-- "python/python/tests/docs/test_basic.py:open_table_async"
``` ```
@@ -310,8 +343,14 @@ If you forget the name of your table, you can always get a listing of all table
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:table_names" --8<-- "python/python/tests/docs/test_basic.py:table_names"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:table_names_async" --8<-- "python/python/tests/docs/test_basic.py:table_names_async"
``` ```
@@ -340,8 +379,14 @@ After a table has been created, you can always add more data to it as follows:
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:add_data" --8<-- "python/python/tests/docs/test_basic.py:add_data"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:add_data_async" --8<-- "python/python/tests/docs/test_basic.py:add_data_async"
``` ```
@@ -370,8 +415,14 @@ Once you've embedded the query, you can find its nearest neighbors as follows:
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:vector_search" --8<-- "python/python/tests/docs/test_basic.py:vector_search"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:vector_search_async" --8<-- "python/python/tests/docs/test_basic.py:vector_search_async"
``` ```
@@ -412,8 +463,14 @@ LanceDB allows you to create an ANN index on a table as follows:
=== "Python" === "Python"
```py === "Sync API"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_index" --8<-- "python/python/tests/docs/test_basic.py:create_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:create_index_async" --8<-- "python/python/tests/docs/test_basic.py:create_index_async"
``` ```
@@ -451,8 +508,14 @@ This can delete any number of rows that match the filter.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:delete_rows" --8<-- "python/python/tests/docs/test_basic.py:delete_rows"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:delete_rows_async" --8<-- "python/python/tests/docs/test_basic.py:delete_rows_async"
``` ```
@@ -483,7 +546,10 @@ simple or complex as needed. To see what expressions are supported, see the
=== "Python" === "Python"
=== "Sync API"
Read more: [lancedb.table.Table.delete][] Read more: [lancedb.table.Table.delete][]
=== "Async API"
Read more: [lancedb.table.AsyncTable.delete][]
=== "Typescript[^1]" === "Typescript[^1]"
@@ -505,8 +571,14 @@ Use the `drop_table()` method on the database to remove a table.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table" --8<-- "python/python/tests/docs/test_basic.py:drop_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table_async" --8<-- "python/python/tests/docs/test_basic.py:drop_table_async"
``` ```
@@ -543,10 +615,17 @@ You can use the embedding API when working with embedding models. It automatical
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_embeddings_optional.py:imports" --8<-- "python/python/tests/docs/test_embeddings_optional.py:imports"
--8<-- "python/python/tests/docs/test_embeddings_optional.py:openai_embeddings" --8<-- "python/python/tests/docs/test_embeddings_optional.py:openai_embeddings"
``` ```
=== "Async API"
Coming soon to the async API.
https://github.com/lancedb/lancedb/issues/1938
=== "Typescript[^1]" === "Typescript[^1]"

View File

@@ -107,7 +107,6 @@ const example = async () => {
// --8<-- [start:search] // --8<-- [start:search]
const query = await tbl.search([100, 100]).limit(2).execute(); const query = await tbl.search([100, 100]).limit(2).execute();
// --8<-- [end:search] // --8<-- [end:search]
console.log(query);
// --8<-- [start:delete] // --8<-- [start:delete]
await tbl.delete('item = "fizz"'); await tbl.delete('item = "fizz"');
@@ -119,8 +118,9 @@ const example = async () => {
}; };
async function main() { async function main() {
console.log("basic_legacy.ts: start");
await example(); await example();
console.log("Basic example: done"); console.log("basic_legacy.ts: done");
} }
main(); main();

View File

@@ -2,7 +2,7 @@
LanceDB Cloud is a SaaS (software-as-a-service) solution that runs serverless in the cloud, clearly separating storage from compute. It's designed to be highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below. LanceDB Cloud is a SaaS (software-as-a-service) solution that runs serverless in the cloud, clearly separating storage from compute. It's designed to be highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below.
[Try out LanceDB Cloud](https://noteforms.com/forms/lancedb-mailing-list-cloud-kty1o5?notionforms=1&utm_source=notionforms){ .md-button .md-button--primary } [Try out LanceDB Cloud (Public Beta)](https://cloud.lancedb.com){ .md-button .md-button--primary }
## Architecture ## Architecture

View File

@@ -7,7 +7,7 @@ Approximate Nearest Neighbor (ANN) search is a method for finding data points ne
There are three main types of ANN search algorithms: There are three main types of ANN search algorithms:
* **Tree-based search algorithms**: Use a tree structure to organize and store data points. * **Tree-based search algorithms**: Use a tree structure to organize and store data points.
* * **Hash-based search algorithms**: Use a specialized geometric hash table to store and manage data points. These algorithms typically focus on theoretical guarantees, and don't usually perform as well as the other approaches in practice. * **Hash-based search algorithms**: Use a specialized geometric hash table to store and manage data points. These algorithms typically focus on theoretical guarantees, and don't usually perform as well as the other approaches in practice.
* **Graph-based search algorithms**: Use a graph structure to store data points, which can be a bit complex. * **Graph-based search algorithms**: Use a graph structure to store data points, which can be a bit complex.
HNSW is a graph-based algorithm. All graph-based search algorithms rely on the idea of a k-nearest neighbor (or k-approximate nearest neighbor) graph, which we outline below. HNSW is a graph-based algorithm. All graph-based search algorithms rely on the idea of a k-nearest neighbor (or k-approximate nearest neighbor) graph, which we outline below.
@@ -59,7 +59,7 @@ Then the greedy search routine operates as follows:
There are three key parameters to set when constructing an HNSW index: There are three key parameters to set when constructing an HNSW index:
* `metric`: Use an `L2` euclidean distance metric. We also support `dot` and `cosine` distance. * `metric`: Use an `l2` euclidean distance metric. We also support `dot` and `cosine` distance.
* `m`: The number of neighbors to select for each vector in the HNSW graph. * `m`: The number of neighbors to select for each vector in the HNSW graph.
* `ef_construction`: The number of candidates to evaluate during the construction of the HNSW graph. * `ef_construction`: The number of candidates to evaluate during the construction of the HNSW graph.

View File

@@ -47,7 +47,7 @@ We can combine the above concepts to understand how to build and query an IVF-PQ
There are three key parameters to set when constructing an IVF-PQ index: There are three key parameters to set when constructing an IVF-PQ index:
* `metric`: Use an `L2` euclidean distance metric. We also support `dot` and `cosine` distance. * `metric`: Use an `l2` euclidean distance metric. We also support `dot` and `cosine` distance.
* `num_partitions`: The number of partitions in the IVF portion of the index. * `num_partitions`: The number of partitions in the IVF portion of the index.
* `num_sub_vectors`: The number of sub-vectors that will be created during Product Quantization (PQ). * `num_sub_vectors`: The number of sub-vectors that will be created during Product Quantization (PQ).
@@ -56,7 +56,7 @@ In Python, the index can be created as follows:
```python ```python
# Create and train the index for a 1536-dimensional vector # Create and train the index for a 1536-dimensional vector
# Make sure you have enough data in the table for an effective training step # Make sure you have enough data in the table for an effective training step
tbl.create_index(metric="L2", num_partitions=256, num_sub_vectors=96) tbl.create_index(metric="l2", num_partitions=256, num_sub_vectors=96)
``` ```
!!! note !!! note
`num_partitions`=256 and `num_sub_vectors`=96 does not work for every dataset. Those values needs to be adjusted for your particular dataset. `num_partitions`=256 and `num_sub_vectors`=96 does not work for every dataset. Those values needs to be adjusted for your particular dataset.

View File

@@ -55,6 +55,14 @@ Let's implement `SentenceTransformerEmbeddings` class. All you need to do is imp
This is a stripped down version of our implementation of `SentenceTransformerEmbeddings` that removes certain optimizations and default settings. This is a stripped down version of our implementation of `SentenceTransformerEmbeddings` that removes certain optimizations and default settings.
!!! danger "Use sensitive keys to prevent leaking secrets"
To prevent leaking secrets, such as API keys, you should add any sensitive
parameters of an embedding function to the output of the
[sensitive_keys()][lancedb.embeddings.base.EmbeddingFunction.sensitive_keys] /
[getSensitiveKeys()](../../js/namespaces/embedding/classes/EmbeddingFunction/#getsensitivekeys)
method. This prevents users from accidentally instantiating the embedding
function with hard-coded secrets.
Now you can use this embedding function to create your table schema and that's it! you can then ingest data and run queries without manually vectorizing the inputs. Now you can use this embedding function to create your table schema and that's it! you can then ingest data and run queries without manually vectorizing the inputs.
=== "Python" === "Python"

View File

@@ -54,7 +54,7 @@ As mentioned, after creating embedding, each data point is represented as a vect
Points that are close to each other in vector space are considered similar (or appear in similar contexts), and points that are far away are considered dissimilar. To quantify this closeness, we use distance as a metric which can be measured in the following way - Points that are close to each other in vector space are considered similar (or appear in similar contexts), and points that are far away are considered dissimilar. To quantify this closeness, we use distance as a metric which can be measured in the following way -
1. **Euclidean Distance (L2)**: It calculates the straight-line distance between two points (vectors) in a multidimensional space. 1. **Euclidean Distance (l2)**: It calculates the straight-line distance between two points (vectors) in a multidimensional space.
2. **Cosine Similarity**: It measures the cosine of the angle between two vectors, providing a normalized measure of similarity based on their direction. 2. **Cosine Similarity**: It measures the cosine of the angle between two vectors, providing a normalized measure of similarity based on their direction.
3. **Dot product**: It is calculated as the sum of the products of their corresponding components. To measure relatedness it considers both the magnitude and direction of the vectors. 3. **Dot product**: It is calculated as the sum of the products of their corresponding components. To measure relatedness it considers both the magnitude and direction of the vectors.

View File

@@ -0,0 +1,53 @@
# Variable and Secrets
Most embedding configuration options are saved in the table's metadata. However,
this isn't always appropriate. For example, API keys should never be stored in the
metadata. Additionally, other configuration options might be best set at runtime,
such as the `device` configuration that controls whether to use GPU or CPU for
inference. If you hardcoded this to GPU, you wouldn't be able to run the code on
a server without one.
To handle these cases, you can set variables on the embedding registry and
reference them in the embedding configuration. These variables will be available
during the runtime of your program, but not saved in the table's metadata. When
the table is loaded from a different process, the variables must be set again.
To set a variable, use the `set_var()` / `setVar()` method on the embedding registry.
To reference a variable, use the syntax `$env:VARIABLE_NAME`. If there is a default
value, you can use the syntax `$env:VARIABLE_NAME:DEFAULT_VALUE`.
## Using variables to set secrets
Sensitive configuration, such as API keys, must either be set as environment
variables or using variables on the embedding registry. If you pass in a hardcoded
value, LanceDB will raise an error. Instead, if you want to set an API key via
configuration, use a variable:
=== "Python"
```python
--8<-- "python/python/tests/docs/test_embeddings_optional.py:register_secret"
```
=== "Typescript"
```typescript
--8<-- "nodejs/examples/embedding.test.ts:register_secret"
```
## Using variables to set the device parameter
Many embedding functions that run locally have a `device` parameter that controls
whether to use GPU or CPU for inference. Because not all computers have a GPU,
it's helpful to be able to set the `device` parameter at runtime, rather than
have it hard coded in the embedding configuration. To make it work even if the
variable isn't set, you could provide a default value of `cpu` in the embedding
configuration.
Some embedding libraries even have a method to detect which devices are available,
which could be used to dynamically set the device at runtime. For example, in Python
you can check if a CUDA GPU is available using `torch.cuda.is_available()`.
```python
--8<-- "python/python/tests/docs/test_embeddings_optional.py:register_device"
```

View File

@@ -8,15 +8,5 @@ LanceDB provides language APIs, allowing you to embed a database in your languag
* 👾 [JavaScript](examples_js.md) examples * 👾 [JavaScript](examples_js.md) examples
* 🦀 Rust examples (coming soon) * 🦀 Rust examples (coming soon)
## Python Applications powered by LanceDB !!! tip "Hosted LanceDB"
If you want S3 cost-efficiency and local performance via a simple serverless API, checkout **LanceDB Cloud**. For private deployments, high performance at extreme scale, or if you have strict security requirements, talk to us about **LanceDB Enterprise**. [Learn more](https://docs.lancedb.com/)
| Project Name | Description |
| --- | --- |
| **Ultralytics Explorer 🚀**<br>[![Ultralytics](https://img.shields.io/badge/Ultralytics-Docs-green?labelColor=0f3bc4&style=flat-square&logo=https://cdn.prod.website-files.com/646dd1f1a3703e451ba81ecc/64994922cf2a6385a4bf4489_UltralyticsYOLO_mark_blue.svg&link=https://docs.ultralytics.com/datasets/explorer/)](https://docs.ultralytics.com/datasets/explorer/)<br>[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/docs/en/datasets/explorer/explorer.ipynb) | - 🔍 **Explore CV Datasets**: Semantic search, SQL queries, vector similarity, natural language.<br>- 🖥️ **GUI & Python API**: Seamless dataset interaction.<br>- ⚡ **Efficient & Scalable**: Leverages LanceDB for large datasets.<br>- 📊 **Detailed Analysis**: Easily analyze data patterns.<br>- 🌐 **Browser GUI Demo**: Create embeddings, search images, run queries. |
| **Website Chatbot🤖**<br>[![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/lancedb/lancedb-vercel-chatbot)<br>[![Deploy with Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Flancedb%2Flancedb-vercel-chatbot&amp;env=OPENAI_API_KEY&amp;envDescription=OpenAI%20API%20Key%20for%20chat%20completion.&amp;project-name=lancedb-vercel-chatbot&amp;repository-name=lancedb-vercel-chatbot&amp;demo-title=LanceDB%20Chatbot%20Demo&amp;demo-description=Demo%20website%20chatbot%20with%20LanceDB.&amp;demo-url=https%3A%2F%2Flancedb.vercel.app&amp;demo-image=https%3A%2F%2Fi.imgur.com%2FazVJtvr.png) | - 🌐 **Chatbot from Sitemap/Docs**: Create a chatbot using site or document context.<br>- 🚀 **Embed LanceDB in Next.js**: Lightweight, on-prem storage.<br>- 🧠 **AI-Powered Context Retrieval**: Efficiently access relevant data.<br>- 🔧 **Serverless & Native JS**: Seamless integration with Next.js.<br>- ⚡ **One-Click Deploy on Vercel**: Quick and easy setup.. |
## Nodejs Applications powered by LanceDB
| Project Name | Description |
| --- | --- |
| **Langchain Writing Assistant✍ **<br>[![Github](../assets/github.svg)](https://github.com/lancedb/vectordb-recipes/tree/main/applications/node/lanchain_writing_assistant) | - **📂 Data Source Integration**: Use your own data by specifying data source file, and the app instantly processes it to provide insights. <br>- **🧠 Intelligent Suggestions**: Powered by LangChain.js and LanceDB, it improves writing productivity and accuracy. <br>- **💡 Enhanced Writing Experience**: It delivers real-time contextual insights and factual suggestions while the user writes. |

1
docs/src/extra_js/reo.js Normal file
View File

@@ -0,0 +1 @@
!function(){var e,t,n;e="9627b71b382d201",t=function(){Reo.init({clientID:"9627b71b382d201"})},(n=document.createElement("script")).src="https://static.reo.dev/"+e+"/reo.js",n.defer=!0,n.onload=t,document.head.appendChild(n)}();

View File

@@ -10,27 +10,19 @@ LanceDB provides support for full-text search via Lance, allowing you to incorpo
Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords. Consider that we have a LanceDB table named `my_table`, whose string column `text` we want to index and query via keyword search, the FTS index must be created before you can search via keywords.
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:basic_fts"
```
=== "Async API"
uri = "data/sample-lancedb" ```python
db = lancedb.connect(uri) --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
table = db.create_table( --8<-- "python/python/tests/docs/test_search.py:basic_fts_async"
"my_table",
data=[
{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
{"vector": [5.9, 26.5], "text": "There are several kittens playing"},
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
``` ```
=== "TypeScript" === "TypeScript"
@@ -50,7 +42,7 @@ Consider that we have a LanceDB table named `my_table`, whose string column `tex
}); });
await tbl await tbl
.search("puppy", queryType="fts") .search("puppy", "fts")
.select(["text"]) .select(["text"])
.limit(10) .limit(10)
.toArray(); .toArray();
@@ -93,8 +85,15 @@ By default the text is tokenized by splitting on punctuation and whitespaces, an
Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the pattern `tokenizer_name="{language_code}_stem"`, e.g. `en_stem` for English. Stemming is useful for improving search results by reducing words to their root form, e.g. "running" to "run". LanceDB supports stemming for multiple languages, you can specify the tokenizer name to enable stemming by the pattern `tokenizer_name="{language_code}_stem"`, e.g. `en_stem` for English.
For example, to enable stemming for English: For example, to enable stemming for English:
=== "Sync API"
```python ```python
table.create_fts_index("text", use_tantivy=True, tokenizer_name="en_stem") --8<-- "python/python/tests/docs/test_search.py:fts_config_stem"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_config_stem_async"
``` ```
the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported. the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html) are currently supported.
@@ -102,12 +101,15 @@ the following [languages](https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.
The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc. The tokenizer is customizable, you can specify how the tokenizer splits the text, and how it filters out words, etc.
For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e': For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e':
=== "Sync API"
```python ```python
table.create_fts_index("text", --8<-- "python/python/tests/docs/test_search.py:fts_config_folding"
use_tantivy=False, ```
language="French", === "Async API"
stem=True,
ascii_folding=True) ```python
--8<-- "python/python/tests/docs/test_search.py:fts_config_folding_async"
``` ```
## Filtering ## Filtering
@@ -119,8 +121,15 @@ This can be invoked via the familiar `where` syntax.
With pre-filtering: With pre-filtering:
=== "Python" === "Python"
=== "Sync API"
```python ```python
table.search("puppy").limit(10).where("meta='foo'", prefilte=True).to_list() --8<-- "python/python/tests/docs/test_search.py:fts_prefiltering"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_prefiltering_async"
``` ```
=== "TypeScript" === "TypeScript"
@@ -151,8 +160,15 @@ With pre-filtering:
With post-filtering: With post-filtering:
=== "Python" === "Python"
=== "Sync API"
```python ```python
table.search("puppy").limit(10).where("meta='foo'", prefilte=False).to_list() --8<-- "python/python/tests/docs/test_search.py:fts_postfiltering"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_postfiltering_async"
``` ```
=== "TypeScript" === "TypeScript"
@@ -191,8 +207,15 @@ or a **terms** search query like `old man sea`. For more details on the terms
query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html). query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html).
To search for a phrase, the index must be created with `with_position=True`: To search for a phrase, the index must be created with `with_position=True`:
=== "Sync API"
```python ```python
table.create_fts_index("text", use_tantivy=False, with_position=True) --8<-- "python/python/tests/docs/test_search.py:fts_with_position"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_with_position_async"
``` ```
This will allow you to search for phrases, but it will also significantly increase the index size and indexing time. This will allow you to search for phrases, but it will also significantly increase the index size and indexing time.
@@ -205,9 +228,15 @@ This can make the query more efficient, especially when the table is large and t
=== "Python" === "Python"
=== "Sync API"
```python ```python
table.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}]) --8<-- "python/python/tests/docs/test_search.py:fts_incremental_index"
table.optimize() ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:fts_incremental_index_async"
``` ```
=== "TypeScript" === "TypeScript"

View File

@@ -2,7 +2,7 @@
LanceDB also provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. LanceDB also provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions.
The tantivy-based FTS is only available in Python and does not support building indexes on object storage or incremental indexing. If you need these features, try native FTS [native FTS](fts.md). The tantivy-based FTS is only available in Python synchronous APIs and does not support building indexes on object storage or incremental indexing. If you need these features, try native FTS [native FTS](fts.md).
## Installation ## Installation

View File

@@ -0,0 +1,85 @@
# Late interaction & MultiVector embedding type
Late interaction is a technique used in retrieval that calculates the relevance of a query to a document by comparing their multi-vector representations. The key difference between late interaction and other popular methods:
![late interaction vs other methods](https://raw.githubusercontent.com/lancedb/assets/b035a0ceb2c237734e0d393054c146d289792339/docs/assets/integration/colbert-blog-interaction.svg)
[ Illustration from https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/]
<b>No interaction:</b> Refers to independently embedding the query and document, that are compared to calcualte similarity without any interaction between them. This is typically used in vector search operations.
<b>Partial interaction</b> Refers to a specific approach where the similarity computation happens primarily between query vectors and document vectors, without extensive interaction between individual components of each. An example of this is dual-encoder models like BERT.
<b>Early full interaction</b> Refers to techniques like cross-encoders that process query and docs in pairs with full interaction across various stages of encoding. This is a powerful, but relatively slower technique. Because it requires processing query and docs in pairs, doc embeddings can't be pre-computed for fast retrieval. This is why cross encoders are typically used as reranking models combined with vector search. Learn more about [LanceDB Reranking support](https://lancedb.github.io/lancedb/reranking/).
<b>Late interaction</b> Late interaction is a technique that calculates the doc and query similarity independently and then the interaction or evaluation happens during the retrieval process. This is typically used in retrieval models like ColBERT. Unlike early interaction, It allows speeding up the retrieval process without compromising the depth of semantic analysis.
## Internals of ColBERT
Let's take a look at the steps involved in performing late interaction based retrieval using ColBERT:
• ColBERT employs BERT-based encoders for both queries `(fQ)` and documents `(fD)`
• A single BERT model is shared between query and document encoders and special tokens distinguish input types: `[Q]` for queries and `[D]` for documents
**Query Encoder (fQ):**
• Query q is tokenized into WordPiece tokens: `q1, q2, ..., ql`. `[Q]` token is prepended right after BERT's `[CLS]` token
• If query length < Nq, it's padded with [MASK] tokens up to Nq.
The padded sequence goes through BERT's transformer architecture
Final embeddings are L2-normalized.
**Document Encoder (fD):**
Document d is tokenized into tokens `d1, d2, ..., dm`. `[D]` token is prepended after `[CLS]` token
Unlike queries, documents are NOT padded with `[MASK]` tokens
Document tokens are processed through BERT and the same linear layer
**Late Interaction:**
Late interaction estimates relevance score `S(q,d)` using embedding `Eq` and `Ed`. Late interaction happens after independent encoding
For each query embedding, maximum similarity is computed against all document embeddings
The similarity measure can be cosine similarity or squared L2 distance
**MaxSim Calculation:**
```
S(q,d) := Σ max(Eqi⋅EdjT)
i∈|Eq| j∈|Ed|
```
This finds the best matching document embedding for each query embedding
Captures relevance based on strongest local matches between contextual embeddings
## LanceDB MultiVector type
LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).
You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. For now, only cosine metric is supported for multivector search. The vector value type can be float16, float32 or float64. LanceDB integrateds [ConteXtualized Token Retriever(XTR)](https://arxiv.org/abs/2304.01982), which introduces a simple, yet novel, objective function that encourages the model to retrieve the most important document tokens first.
```python
import lancedb
import numpy as np
import pyarrow as pa
db = lancedb.connect("data/multivector_demo")
schema = pa.schema(
[
pa.field("id", pa.int64()),
# float16, float32, and float64 are supported
pa.field("vector", pa.list_(pa.list_(pa.float32(), 256))),
]
)
data = [
{
"id": i,
"vector": np.random.random(size=(2, 256)).tolist(),
}
for i in range(1024)
]
tbl = db.create_table("my_table", data=data, schema=schema)
# only cosine similarity is supported for multi-vectors
tbl.create_index(metric="cosine")
# query with single vector
query = np.random.random(256).astype(np.float16)
tbl.search(query).to_arrow()
# query with multiple vectors
query = np.random.random(size=(2, 256))
tbl.search(query).to_arrow()
```
Find more about vector search in LanceDB [here](https://lancedb.github.io/lancedb/search/#multivector-type).

View File

@@ -32,18 +32,19 @@ over scalar columns.
### Create a scalar index ### Create a scalar index
=== "Python" === "Python"
```python === "Sync API"
import lancedb
books = [
{"book_id": 1, "publisher": "plenty of books", "tags": ["fantasy", "adventure"]},
{"book_id": 2, "publisher": "book town", "tags": ["non-fiction"]},
{"book_id": 3, "publisher": "oreilly", "tags": ["textbook"]}
]
db = lancedb.connect("./db") ```python
table = db.create_table("books", books) --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
table.create_scalar_index("book_id") # BTree by default --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-btree-bitmap"
table.create_scalar_index("publisher", index_type="BITMAP") --8<-- "python/python/tests/docs/test_guide_index.py:basic_scalar_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb-btree-bitmap"
--8<-- "python/python/tests/docs/test_guide_index.py:basic_scalar_index_async"
``` ```
=== "Typescript" === "Typescript"
@@ -62,11 +63,17 @@ The following scan will be faster if the column `book_id` has a scalar index:
=== "Python" === "Python"
```python === "Sync API"
import lancedb
table = db.open_table("books") ```python
my_df = table.search().where("book_id = 2").to_pandas() --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:search_with_scalar_index"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:search_with_scalar_index_async"
``` ```
=== "Typescript" === "Typescript"
@@ -88,21 +95,17 @@ Scalar indices can also speed up scans containing a vector search or full text s
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_scalar_index"
```
=== "Async API"
data = [ ```python
{"book_id": 1, "vector": [1, 2]}, --8<-- "python/python/tests/docs/test_guide_index.py:import-lancedb"
{"book_id": 2, "vector": [3, 4]}, --8<-- "python/python/tests/docs/test_guide_index.py:vector_search_with_scalar_index_async"
{"book_id": 3, "vector": [5, 6]}
]
table = db.create_table("book_with_embeddings", data)
(
table.search([1, 2])
.where("book_id != 3", prefilter=True)
.to_pandas()
)
``` ```
=== "Typescript" === "Typescript"
@@ -122,9 +125,15 @@ Scalar indices can also speed up scans containing a vector search or full text s
Updating the table data (adding, deleting, or modifying records) requires that you also update the scalar index. This can be done by calling `optimize`, which will trigger an update to the existing scalar index. Updating the table data (adding, deleting, or modifying records) requires that you also update the scalar index. This can be done by calling `optimize`, which will trigger an update to the existing scalar index.
=== "Python" === "Python"
=== "Sync API"
```python ```python
table.add([{"vector": [7, 8], "book_id": 4}]) --8<-- "python/python/tests/docs/test_guide_index.py:update_scalar_index"
table.optimize() ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_index.py:update_scalar_index_async"
``` ```
=== "TypeScript" === "TypeScript"

View File

@@ -12,26 +12,50 @@ LanceDB OSS supports object stores such as AWS S3 (and compatible stores), Azure
=== "Python" === "Python"
AWS S3: AWS S3:
=== "Sync API"
```python ```python
import lancedb import lancedb
db = lancedb.connect("s3://bucket/path") db = lancedb.connect("s3://bucket/path")
``` ```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("s3://bucket/path")
```
Google Cloud Storage: Google Cloud Storage:
=== "Sync API"
```python ```python
import lancedb import lancedb
db = lancedb.connect("gs://bucket/path") db = lancedb.connect("gs://bucket/path")
``` ```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("gs://bucket/path")
```
Azure Blob Storage: Azure Blob Storage:
<!-- skip-test --> <!-- skip-test -->
=== "Sync API"
```python ```python
import lancedb import lancedb
db = lancedb.connect("az://bucket/path") db = lancedb.connect("az://bucket/path")
``` ```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("az://bucket/path")
```
Note that for Azure, storage credentials must be configured. See [below](#azure-blob-storage) for more details. Note that for Azure, storage credentials must be configured. See [below](#azure-blob-storage) for more details.
@@ -94,9 +118,20 @@ If you only want this to apply to one particular connection, you can pass the `s
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"s3://bucket/path",
storage_options={"timeout": "60s"}
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path", "s3://bucket/path",
storage_options={"timeout": "60s"} storage_options={"timeout": "60s"}
) )
@@ -128,10 +163,24 @@ Getting even more specific, you can set the `timeout` for only a particular tabl
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async("s3://bucket/path") db = lancedb.connect("s3://bucket/path")
table = await db.create_table( table = db.create_table(
"table",
[{"a": 1, "b": 2}],
storage_options={"timeout": "60s"}
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async("s3://bucket/path")
async_table = await async_db.create_table(
"table", "table",
[{"a": 1, "b": 2}], [{"a": 1, "b": 2}],
storage_options={"timeout": "60s"} storage_options={"timeout": "60s"}
@@ -194,9 +243,24 @@ These can be set as environment variables or passed in the `storage_options` par
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"s3://bucket/path",
storage_options={
"aws_access_key_id": "my-access-key",
"aws_secret_access_key": "my-secret-key",
"aws_session_token": "my-session-token",
}
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path", "s3://bucket/path",
storage_options={ storage_options={
"aws_access_key_id": "my-access-key", "aws_access_key_id": "my-access-key",
@@ -278,7 +342,7 @@ For **read and write access**, LanceDB will need a policy such as:
"Action": [ "Action": [
"s3:PutObject", "s3:PutObject",
"s3:GetObject", "s3:GetObject",
"s3:DeleteObject", "s3:DeleteObject"
], ],
"Resource": "arn:aws:s3:::<bucket>/<prefix>/*" "Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
}, },
@@ -310,7 +374,7 @@ For **read-only access**, LanceDB will need a policy such as:
{ {
"Effect": "Allow", "Effect": "Allow",
"Action": [ "Action": [
"s3:GetObject", "s3:GetObject"
], ],
"Resource": "arn:aws:s3:::<bucket>/<prefix>/*" "Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
}, },
@@ -348,9 +412,19 @@ name of the table to use.
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table", "s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
) )
``` ```
@@ -441,9 +515,23 @@ LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you m
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"s3://bucket/path",
storage_options={
"region": "us-east-1",
"endpoint": "http://minio:9000",
}
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://bucket/path", "s3://bucket/path",
storage_options={ storage_options={
"region": "us-east-1", "region": "us-east-1",
@@ -502,9 +590,23 @@ To configure LanceDB to use an S3 Express endpoint, you must set the storage opt
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"s3://my-bucket--use1-az4--x-s3/path",
storage_options={
"region": "us-east-1",
"s3_express": "true",
}
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3://my-bucket--use1-az4--x-s3/path", "s3://my-bucket--use1-az4--x-s3/path",
storage_options={ storage_options={
"region": "us-east-1", "region": "us-east-1",
@@ -552,9 +654,23 @@ GCS credentials are configured by setting the `GOOGLE_SERVICE_ACCOUNT` environme
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"gs://my-bucket/my-database",
storage_options={
"service_account": "path/to/service-account.json",
}
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"gs://my-bucket/my-database", "gs://my-bucket/my-database",
storage_options={ storage_options={
"service_account": "path/to/service-account.json", "service_account": "path/to/service-account.json",
@@ -612,9 +728,24 @@ Azure Blob Storage credentials can be configured by setting the `AZURE_STORAGE_A
=== "Python" === "Python"
<!-- skip-test --> <!-- skip-test -->
=== "Sync API"
```python ```python
import lancedb import lancedb
db = await lancedb.connect_async( db = lancedb.connect(
"az://my-container/my-database",
storage_options={
account_name: "some-account",
account_key: "some-key",
}
)
```
<!-- skip-test -->
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"az://my-container/my-database", "az://my-container/my-database",
storage_options={ storage_options={
account_name: "some-account", account_name: "some-account",

View File

@@ -12,9 +12,17 @@ Initialize a LanceDB connection and create a table
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
db = lancedb.connect("./.lancedb") --8<-- "python/python/tests/docs/test_guide_tables.py:connect"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:connect_async"
``` ```
LanceDB allows ingesting data from various sources - `dict`, `list[dict]`, `pd.DataFrame`, `pa.Table` or a `Iterator[pa.RecordBatch]`. Let's take a look at some of the these. LanceDB allows ingesting data from various sources - `dict`, `list[dict]`, `pd.DataFrame`, `pa.Table` or a `Iterator[pa.RecordBatch]`. Let's take a look at some of the these.
@@ -47,17 +55,15 @@ Initialize a LanceDB connection and create a table
=== "Python" === "Python"
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:create_table"
```
=== "Async API"
db = lancedb.connect("./.lancedb") ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async"
data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
{"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}]
db.create_table("my_table", data)
db["my_table"].head()
``` ```
!!! info "Note" !!! info "Note"
@@ -67,15 +73,29 @@ Initialize a LanceDB connection and create a table
and the table exists, then it simply opens the existing table. The data you and the table exists, then it simply opens the existing table. The data you
passed in will NOT be appended to the table in that case. passed in will NOT be appended to the table in that case.
=== "Sync API"
```python ```python
db.create_table("name", data, exist_ok=True) --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_exist_ok"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_exist_ok"
``` ```
Sometimes you want to make sure that you start fresh. If you want to Sometimes you want to make sure that you start fresh. If you want to
overwrite the table, you can pass in mode="overwrite" to the createTable function. overwrite the table, you can pass in mode="overwrite" to the createTable function.
=== "Sync API"
```python ```python
db.create_table("name", data, mode="overwrite") --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_overwrite"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_overwrite"
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -146,18 +166,18 @@ Initialize a LanceDB connection and create a table
### From a Pandas DataFrame ### From a Pandas DataFrame
=== "Sync API"
```python ```python
import pandas as pd --8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_pandas"
```
=== "Async API"
data = pd.DataFrame({ ```python
"vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]], --8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
"lat": [45.5, 40.1], --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_pandas"
"long": [-122.7, -74.1]
})
db.create_table("my_table", data)
db["my_table"].head()
``` ```
!!! info "Note" !!! info "Note"
@@ -165,14 +185,17 @@ db["my_table"].head()
The **`vector`** column needs to be a [Vector](../python/pydantic.md#vector-field) (defined as [pyarrow.FixedSizeList](https://arrow.apache.org/docs/python/generated/pyarrow.list_.html)) type. The **`vector`** column needs to be a [Vector](../python/pydantic.md#vector-field) (defined as [pyarrow.FixedSizeList](https://arrow.apache.org/docs/python/generated/pyarrow.list_.html)) type.
```python === "Sync API"
custom_schema = pa.schema([
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("lat", pa.float32()),
pa.field("long", pa.float32())
])
table = db.create_table("my_table", data, schema=custom_schema) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_custom_schema"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_custom_schema"
``` ```
### From a Polars DataFrame ### From a Polars DataFrame
@@ -182,15 +205,17 @@ written in Rust. Just like in Pandas, the Polars integration is enabled by PyArr
under the hood. A deeper integration between LanceDB Tables and Polars DataFrames under the hood. A deeper integration between LanceDB Tables and Polars DataFrames
is on the way. is on the way.
```python === "Sync API"
import polars as pl
data = pl.DataFrame({ ```python
"vector": [[3.1, 4.1], [5.9, 26.5]], --8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
"item": ["foo", "bar"], --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_polars"
"price": [10.0, 20.0] ```
}) === "Async API"
table = db.create_table("pl_table", data=data)
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_polars"
``` ```
### From an Arrow Table ### From an Arrow Table
@@ -198,28 +223,19 @@ You can also create LanceDB tables directly from Arrow tables.
LanceDB supports float16 data type! LanceDB supports float16 data type!
=== "Python" === "Python"
=== "Sync API"
```python ```python
import pyarrows as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
import numpy as np --8<-- "python/python/tests/docs/test_guide_tables.py:import-numpy"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_arrow_table"
```
=== "Async API"
dim = 16 ```python
total = 2 --8<-- "python/python/tests/docs/test_guide_tables.py:import-polars"
schema = pa.schema( --8<-- "python/python/tests/docs/test_guide_tables.py:import-numpy"
[ --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_arrow_table"
pa.field("vector", pa.list_(pa.float16(), dim)),
pa.field("text", pa.string())
]
)
data = pa.Table.from_arrays(
[
pa.array([np.random.randn(dim).astype(np.float16) for _ in range(total)],
pa.list_(pa.float16(), dim)),
pa.array(["foo", "bar"])
],
["vector", "text"],
)
tbl = db.create_table("f16_tbl", data, schema=schema)
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -250,24 +266,21 @@ can be configured with the vector dimensions. It is also important to note that
LanceDB only understands subclasses of `lancedb.pydantic.LanceModel` LanceDB only understands subclasses of `lancedb.pydantic.LanceModel`
(which itself derives from `pydantic.BaseModel`). (which itself derives from `pydantic.BaseModel`).
=== "Sync API"
```python ```python
from lancedb.pydantic import Vector, LanceModel --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:class-Content"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_pydantic"
```
=== "Async API"
class Content(LanceModel): ```python
movie_id: int --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
vector: Vector(128) --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
genres: str --8<-- "python/python/tests/docs/test_guide_tables.py:class-Content"
title: str --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_pydantic"
imdb_id: int
@property
def imdb_url(self) -> str:
return f"https://www.imdb.com/title/tt{self.imdb_id}"
import pyarrow as pa
db = lancedb.connect("~/.lancedb")
table_name = "movielens_small"
table = db.create_table(table_name, schema=Content)
``` ```
#### Nested schemas #### Nested schemas
@@ -277,22 +290,24 @@ For example, you may want to store the document string
and the document source name as a nested Document object: and the document source name as a nested Document object:
```python ```python
class Document(BaseModel): --8<-- "python/python/tests/docs/test_guide_tables.py:import-pydantic-basemodel"
content: str --8<-- "python/python/tests/docs/test_guide_tables.py:class-Document"
source: str
``` ```
This can be used as the type of a LanceDB table column: This can be used as the type of a LanceDB table column:
=== "Sync API"
```python ```python
class NestedSchema(LanceModel): --8<-- "python/python/tests/docs/test_guide_tables.py:class-NestedSchema"
id: str --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_nested_schema"
vector: Vector(1536)
document: Document
tbl = db.create_table("nested_table", schema=NestedSchema, mode="overwrite")
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:class-NestedSchema"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_nested_schema"
```
This creates a struct column called "document" that has two subfields This creates a struct column called "document" that has two subfields
called "content" and "source": called "content" and "source":
@@ -356,28 +371,19 @@ LanceDB additionally supports PyArrow's `RecordBatch` Iterators or other generat
Here's an example using using `RecordBatch` iterator for creating tables. Here's an example using using `RecordBatch` iterator for creating tables.
=== "Sync API"
```python ```python
import pyarrow as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:make_batches"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_from_batch"
```
=== "Async API"
def make_batches(): ```python
for i in range(5): --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
yield pa.RecordBatch.from_arrays( --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches"
[ --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_from_batch"
pa.array([[3.1, 4.1, 5.1, 6.1], [5.9, 26.5, 4.7, 32.8]],
pa.list_(pa.float32(), 4)),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)
schema = pa.schema([
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
])
db.create_table("batched_tale", make_batches(), schema=schema)
``` ```
You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example. You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example.
@@ -387,14 +393,28 @@ You can also use iterators of other types like Pandas DataFrame or Pylists direc
=== "Python" === "Python"
If you forget the name of your table, you can always get a listing of all table names. If you forget the name of your table, you can always get a listing of all table names.
=== "Sync API"
```python ```python
print(db.table_names()) --8<-- "python/python/tests/docs/test_guide_tables.py:list_tables"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:list_tables_async"
``` ```
Then, you can open any existing tables. Then, you can open any existing tables.
=== "Sync API"
```python ```python
tbl = db.open_table("my_table") --8<-- "python/python/tests/docs/test_guide_tables.py:open_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:open_table_async"
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -418,34 +438,40 @@ You can create an empty table for scenarios where you want to add data to the ta
An empty table can be initialized via a PyArrow schema. An empty table can be initialized via a PyArrow schema.
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
import pyarrow as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table"
```
=== "Async API"
schema = pa.schema( ```python
[ --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
pa.field("vector", pa.list_(pa.float32(), 2)), --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
pa.field("item", pa.string()), --8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_async"
pa.field("price", pa.float32()),
])
tbl = db.create_table("empty_table_add", schema=schema)
``` ```
Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not
directly import `pydantic` but instead use `lancedb.pydantic` which is a subclass of `pydantic.BaseModel` directly import `pydantic` but instead use `lancedb.pydantic` which is a subclass of `pydantic.BaseModel`
that has been extended to support LanceDB specific types like `Vector`. that has been extended to support LanceDB specific types like `Vector`.
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
from lancedb.pydantic import LanceModel, vector --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_guide_tables.py:class-Item"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_pydantic"
```
=== "Async API"
class Item(LanceModel): ```python
vector: Vector(2) --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
item: str --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
price: float --8<-- "python/python/tests/docs/test_guide_tables.py:class-Item"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_empty_table_async_pydantic"
tbl = db.create_table("empty_table_add", schema=Item.to_arrow_schema())
``` ```
Once the empty table has been created, you can add data to it via the various methods listed in the [Adding to a table](#adding-to-a-table) section. Once the empty table has been created, you can add data to it via the various methods listed in the [Adding to a table](#adding-to-a-table) section.
@@ -473,85 +499,95 @@ After a table has been created, you can always add more data to it using the `ad
### Add a Pandas DataFrame ### Add a Pandas DataFrame
=== "Sync API"
```python ```python
df = pd.DataFrame({ --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pandas"
"vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0] ```
}) === "Async API"
tbl.add(df)
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pandas"
``` ```
### Add a Polars DataFrame ### Add a Polars DataFrame
=== "Sync API"
```python ```python
df = pl.DataFrame({ --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_polars"
"vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0] ```
}) === "Async API"
tbl.add(df)
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_polars"
``` ```
### Add an Iterator ### Add an Iterator
You can also add a large dataset batch in one go using Iterator of any supported data types. You can also add a large dataset batch in one go using Iterator of any supported data types.
=== "Sync API"
```python ```python
def make_batches(): --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches_for_add"
for i in range(5): --8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_batch"
yield [ ```
{"vector": [3.1, 4.1], "item": "peach", "price": 6.0}, === "Async API"
{"vector": [5.9, 26.5], "item": "pear", "price": 5.0}
] ```python
tbl.add(make_batches()) --8<-- "python/python/tests/docs/test_guide_tables.py:make_batches_for_add"
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_batch"
``` ```
### Add a PyArrow table ### Add a PyArrow table
If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table. If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table.
```python === "Sync API"
pa_table = pa.Table.from_arrays(
[
pa.array([[9.1, 6.7], [9.9, 31.2]],
pa.list_(pa.float32(), 2)),
pa.array(["mango", "orange"]),
pa.array([7.0, 4.0]),
],
["vector", "item", "price"],
)
tbl.add(pa_table) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pyarrow"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pyarrow"
``` ```
### Add a Pydantic Model ### Add a Pydantic Model
Assuming that a table has been created with the correct schema as shown [above](#creating-empty-table), you can add data items that are valid Pydantic models to the table. Assuming that a table has been created with the correct schema as shown [above](#creating-empty-table), you can add data items that are valid Pydantic models to the table.
```python === "Sync API"
pydantic_model_items = [
Item(vector=[8.1, 4.7], item="pineapple", price=10.0),
Item(vector=[6.9, 9.3], item="avocado", price=9.0)
]
tbl.add(pydantic_model_items) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_from_pydantic"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:add_table_async_from_pydantic"
``` ```
??? "Ingesting Pydantic models with LanceDB embedding API" ??? "Ingesting Pydantic models with LanceDB embedding API"
When using LanceDB's embedding API, you can add Pydantic models directly to the table. LanceDB will automatically convert the `vector` field to a vector before adding it to the table. You need to specify the default value of `vector` field as None to allow LanceDB to automatically vectorize the data. When using LanceDB's embedding API, you can add Pydantic models directly to the table. LanceDB will automatically convert the `vector` field to a vector before adding it to the table. You need to specify the default value of `vector` field as None to allow LanceDB to automatically vectorize the data.
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
from lancedb.pydantic import LanceModel, Vector --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
from lancedb.embeddings import get_registry --8<-- "python/python/tests/docs/test_guide_tables.py:import-embeddings"
--8<-- "python/python/tests/docs/test_guide_tables.py:create_table_with_embedding"
```
=== "Async API"
db = lancedb.connect("~/tmp") ```python
embed_fcn = get_registry().get("huggingface").create(name="BAAI/bge-small-en-v1.5") --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb-pydantic"
class Schema(LanceModel): --8<-- "python/python/tests/docs/test_guide_tables.py:import-embeddings"
text: str = embed_fcn.SourceField() --8<-- "python/python/tests/docs/test_guide_tables.py:create_table_async_with_embedding"
vector: Vector(embed_fcn.ndims()) = embed_fcn.VectorField(default=None)
tbl = db.create_table("my_table", schema=Schema, mode="overwrite")
models = [Schema(text="hello"), Schema(text="world")]
tbl.add(models)
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -565,49 +601,78 @@ After a table has been created, you can always add more data to it using the `ad
) )
``` ```
## Upserting into a table
Upserting lets you insert new rows or update existing rows in a table. To upsert
in LanceDB, use the merge insert API.
=== "Python"
=== "Sync API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:upsert_basic"
```
**API Reference**: [lancedb.table.Table.merge_insert][]
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:upsert_basic_async"
```
**API Reference**: [lancedb.table.AsyncTable.merge_insert][]
=== "Typescript[^1]"
=== "@lancedb/lancedb"
```typescript
--8<-- "nodejs/examples/merge_insert.test.ts:upsert_basic"
```
**API Reference**: [lancedb.Table.mergeInsert](../js/classes/Table.md/#mergeInsert)
Read more in the guide on [merge insert](tables/merge_insert.md).
## Deleting from a table ## Deleting from a table
Use the `delete()` method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter. Use the `delete()` method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter.
=== "Python" === "Python"
=== "Sync API"
```python ```python
tbl.delete('item = "fizz"') --8<-- "python/python/tests/docs/test_guide_tables.py:delete_row"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:delete_row_async"
``` ```
### Deleting row with specific column value ### Deleting row with specific column value
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:delete_specific_row"
```
=== "Async API"
data = [{"x": 1, "vector": [1, 2]}, ```python
{"x": 2, "vector": [3, 4]}, --8<-- "python/python/tests/docs/test_guide_tables.py:delete_specific_row_async"
{"x": 3, "vector": [5, 6]}]
db = lancedb.connect("./.lancedb")
table = db.create_table("my_table", data)
table.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 2 [3.0, 4.0]
# 2 3 [5.0, 6.0]
table.delete("x = 2")
table.to_pandas()
# x vector
# 0 1 [1.0, 2.0]
# 1 3 [5.0, 6.0]
``` ```
### Delete from a list of values ### Delete from a list of values
=== "Sync API"
```python ```python
to_remove = [1, 5] --8<-- "python/python/tests/docs/test_guide_tables.py:delete_list_values"
to_remove = ", ".join(str(v) for v in to_remove) ```
=== "Async API"
table.delete(f"x IN ({to_remove})") ```python
table.to_pandas() --8<-- "python/python/tests/docs/test_guide_tables.py:delete_list_values_async"
# x vector
# 0 3 [5.0, 6.0]
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -659,26 +724,19 @@ This can be used to update zero to all rows depending on how many rows match the
=== "Python" === "Python"
API Reference: [lancedb.table.Table.update][] API Reference: [lancedb.table.Table.update][]
=== "Sync API"
```python ```python
import lancedb --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
import pandas as pd --8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
--8<-- "python/python/tests/docs/test_guide_tables.py:update_table"
```
=== "Async API"
# Create a lancedb connection ```python
db = lancedb.connect("./.lancedb") --8<-- "python/python/tests/docs/test_guide_tables.py:import-lancedb"
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pandas"
# Create a table from a pandas DataFrame --8<-- "python/python/tests/docs/test_guide_tables.py:update_table_async"
data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
table = db.create_table("my_table", data)
# Update the table where x = 2
table.update(where="x = 2", values={"vector": [10, 10]})
# Get the updated table as a pandas DataFrame
df = table.to_pandas()
# Print the DataFrame
print(df)
``` ```
Output Output
@@ -707,7 +765,10 @@ This can be used to update zero to all rows depending on how many rows match the
]; ];
const tbl = await db.createTable("my_table", data) const tbl = await db.createTable("my_table", data)
await tbl.update({vector: [10, 10]}, { where: "x = 2"}) await tbl.update({
values: { vector: [10, 10] },
where: "x = 2"
});
``` ```
=== "vectordb (deprecated)" === "vectordb (deprecated)"
@@ -726,7 +787,10 @@ This can be used to update zero to all rows depending on how many rows match the
]; ];
const tbl = await db.createTable("my_table", data) const tbl = await db.createTable("my_table", data)
await tbl.update({ where: "x = 2", values: {vector: [10, 10]} }) await tbl.update({
where: "x = 2",
values: { vector: [10, 10] }
});
``` ```
#### Updating using a sql query #### Updating using a sql query
@@ -734,12 +798,15 @@ This can be used to update zero to all rows depending on how many rows match the
The `values` parameter is used to provide the new values for the columns as literal values. You can also use the `values_sql` / `valuesSql` parameter to provide SQL expressions for the new values. For example, you can use `values_sql="x + 1"` to increment the value of the `x` column by 1. The `values` parameter is used to provide the new values for the columns as literal values. You can also use the `values_sql` / `valuesSql` parameter to provide SQL expressions for the new values. For example, you can use `values_sql="x + 1"` to increment the value of the `x` column by 1.
=== "Python" === "Python"
=== "Sync API"
```python ```python
# Update the table where x = 2 --8<-- "python/python/tests/docs/test_guide_tables.py:update_table_sql"
table.update(valuesSql={"x": "x + 1"}) ```
=== "Async API"
print(table.to_pandas()) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:update_table_sql_async"
``` ```
Output Output
@@ -771,9 +838,14 @@ This can be used to update zero to all rows depending on how many rows match the
Use the `drop_table()` method on the database to remove a table. Use the `drop_table()` method on the database to remove a table.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table" --8<-- "python/python/tests/docs/test_basic.py:drop_table"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_table_async" --8<-- "python/python/tests/docs/test_basic.py:drop_table_async"
``` ```
@@ -809,9 +881,16 @@ data type for it.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:add_columns" --8<-- "python/python/tests/docs/test_basic.py:add_columns"
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:add_columns_async"
```
**API Reference:** [lancedb.table.Table.add_columns][] **API Reference:** [lancedb.table.Table.add_columns][]
=== "Typescript" === "Typescript"
@@ -848,10 +927,18 @@ rewriting the column, which can be a heavy operation.
=== "Python" === "Python"
=== "Sync API"
```python ```python
import pyarrow as pa --8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_basic.py:alter_columns" --8<-- "python/python/tests/docs/test_basic.py:alter_columns"
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-pyarrow"
--8<-- "python/python/tests/docs/test_basic.py:alter_columns_async"
```
**API Reference:** [lancedb.table.Table.alter_columns][] **API Reference:** [lancedb.table.Table.alter_columns][]
=== "Typescript" === "Typescript"
@@ -872,9 +959,16 @@ will remove the column from the schema.
=== "Python" === "Python"
=== "Sync API"
```python ```python
--8<-- "python/python/tests/docs/test_basic.py:drop_columns" --8<-- "python/python/tests/docs/test_basic.py:drop_columns"
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_basic.py:drop_columns_async"
```
**API Reference:** [lancedb.table.Table.drop_columns][] **API Reference:** [lancedb.table.Table.drop_columns][]
=== "Typescript" === "Typescript"
@@ -925,30 +1019,45 @@ There are three possible settings for `read_consistency_interval`:
To set strong consistency, use `timedelta(0)`: To set strong consistency, use `timedelta(0)`:
=== "Sync API"
```python ```python
from datetime import timedelta --8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
db = lancedb.connect("./.lancedb",. read_consistency_interval=timedelta(0)) --8<-- "python/python/tests/docs/test_guide_tables.py:table_strong_consistency"
table = db.open_table("my_table") ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
--8<-- "python/python/tests/docs/test_guide_tables.py:table_async_strong_consistency"
``` ```
For eventual consistency, use a custom `timedelta`: For eventual consistency, use a custom `timedelta`:
=== "Sync API"
```python ```python
from datetime import timedelta --8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
db = lancedb.connect("./.lancedb", read_consistency_interval=timedelta(seconds=5)) --8<-- "python/python/tests/docs/test_guide_tables.py:table_eventual_consistency"
table = db.open_table("my_table") ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_guide_tables.py:import-datetime"
--8<-- "python/python/tests/docs/test_guide_tables.py:table_async_eventual_consistency"
``` ```
By default, a `Table` will never check for updates from other writers. To manually check for updates you can use `checkout_latest`: By default, a `Table` will never check for updates from other writers. To manually check for updates you can use `checkout_latest`:
=== "Sync API"
```python ```python
db = lancedb.connect("./.lancedb") --8<-- "python/python/tests/docs/test_guide_tables.py:table_checkout_latest"
table = db.open_table("my_table") ```
=== "Async API"
# (Other writes happen to my_table from another process) ```python
--8<-- "python/python/tests/docs/test_guide_tables.py:table_async_checkout_latest"
# Check for updates
table.checkout_latest()
``` ```
=== "Typescript[^1]" === "Typescript[^1]"
@@ -957,14 +1066,14 @@ There are three possible settings for `read_consistency_interval`:
```ts ```ts
const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 0 }); const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 0 });
const table = await db.openTable("my_table"); const tbl = await db.openTable("my_table");
``` ```
For eventual consistency, specify the update interval as seconds: For eventual consistency, specify the update interval as seconds:
```ts ```ts
const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 5 }); const db = await lancedb.connect({ uri: "./.lancedb", readConsistencyInterval: 5 });
const table = await db.openTable("my_table"); const tbl = await db.openTable("my_table");
``` ```
<!-- Node doesn't yet support the version time travel: https://github.com/lancedb/lancedb/issues/1007 <!-- Node doesn't yet support the version time travel: https://github.com/lancedb/lancedb/issues/1007

View File

@@ -0,0 +1,135 @@
The merge insert command is a flexible API that can be used to perform:
1. Upsert
2. Insert-if-not-exists
3. Replace range
It works by joining the input data with the target table on a key you provide.
Often this key is a unique row id key. You can then specify what to do when
there is a match and when there is not a match. For example, for upsert you want
to update if the row has a match and insert if the row doesn't have a match.
Whereas for insert-if-not-exists you only want to insert if the row doesn't have
a match.
You can also read more in the API reference:
* Python
* Sync: [lancedb.table.Table.merge_insert][]
* Async: [lancedb.table.AsyncTable.merge_insert][]
* Typescript: [lancedb.Table.mergeInsert](../../js/classes/Table.md/#mergeinsert)
!!! tip "Use scalar indices to speed up merge insert"
The merge insert command needs to perform a join between the input data and the
target table on the `on` key you provide. This requires scanning that entire
column, which can be expensive for large tables. To speed up this operation,
you can create a scalar index on the `on` column, which will allow LanceDB to
find matches without having to scan the whole tables.
Read more about scalar indices in [Building a Scalar Index](../scalar_index.md)
guide.
!!! info "Embedding Functions"
Like the create table and add APIs, the merge insert API will automatically
compute embeddings if the table has a embedding definition in its schema.
If the input data doesn't contain the source column, or the vector column
is already filled, then the embeddings won't be computed. See the
[Embedding Functions](../../embeddings/embedding_functions.md) guide for more
information.
## Upsert
Upsert updates rows if they exist and inserts them if they don't. To do this
with merge insert, enable both `when_matched_update_all()` and
`when_not_matched_insert_all()`.
=== "Python"
=== "Sync API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:upsert_basic"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:upsert_basic_async"
```
=== "Typescript"
=== "@lancedb/lancedb"
```typescript
--8<-- "nodejs/examples/merge_insert.test.ts:upsert_basic"
```
!!! note "Providing subsets of columns"
If a column is nullable, it can be omitted from input data and it will be
considered `null`. Columns can also be provided in any order.
## Insert-if-not-exists
To avoid inserting duplicate rows, you can use the insert-if-not-exists command.
This will only insert rows that do not have a match in the target table. To do
this with merge insert, enable just `when_not_matched_insert_all()`.
=== "Python"
=== "Sync API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:insert_if_not_exists"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:insert_if_not_exists_async"
```
=== "Typescript"
=== "@lancedb/lancedb"
```typescript
--8<-- "nodejs/examples/merge_insert.test.ts:insert_if_not_exists"
```
## Replace range
You can also replace a range of rows in the target table with the input data.
For example, if you have a table of document chunks, where each chunk has
both a `doc_id` and a `chunk_id`, you can replace all chunks for a given
`doc_id` with updated chunks. This can be tricky otherwise because if you
try to use upsert when the new data has fewer chunks you will end up with
extra chunks. To avoid this, add another clause to delete any chunks for
the document that are not in the new data, with
`when_not_matched_by_source_delete`.
=== "Python"
=== "Sync API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:replace_range"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_merge_insert.py:replace_range_async"
```
=== "Typescript"
=== "@lancedb/lancedb"
```typescript
--8<-- "nodejs/examples/merge_insert.test.ts:replace_range"
```

View File

@@ -1,8 +1,8 @@
## Improving retriever performance ## Improving retriever performance
Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/> Try it yourself: <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
VectorDBs are used as retreivers in recommender or chatbot-based systems for retrieving relevant data based on user queries. For example, retriever is a critical component of Retrieval Augmented Generation (RAG) acrhitectures. In this section, we will discuss how to improve the performance of retrievers. VectorDBs are used as retrievers in recommender or chatbot-based systems for retrieving relevant data based on user queries. For example, retrievers are a critical component of Retrieval Augmented Generation (RAG) acrhitectures. In this section, we will discuss how to improve the performance of retrievers.
There are serveral ways to improve the performance of retrievers. Some of the common techniques are: There are serveral ways to improve the performance of retrievers. Some of the common techniques are:
@@ -19,7 +19,7 @@ Using different embedding models is something that's very specific to the use ca
## The dataset ## The dataset
We'll be using a QA dataset generated using a LLama2 review paper. The dataset contains 221 query, context and answer triplets. The queries and answers are generated using GPT-4 based on a given query. Full script used to generate the dataset can be found on this [repo](https://github.com/lancedb/ragged). It can be downloaded from [here](https://github.com/AyushExel/assets/blob/main/data_qa.csv) We'll be using a QA dataset generated using a LLama2 review paper. The dataset contains 221 query, context and answer triplets. The queries and answers are generated using GPT-4 based on a given query. Full script used to generate the dataset can be found on this [repo](https://github.com/lancedb/ragged). It can be downloaded from [here](https://github.com/AyushExel/assets/blob/main/data_qa.csv).
### Using different query types ### Using different query types
Let's setup the embeddings and the dataset first. We'll use the LanceDB's `huggingface` embeddings integration for this guide. Let's setup the embeddings and the dataset first. We'll use the LanceDB's `huggingface` embeddings integration for this guide.
@@ -45,14 +45,14 @@ table.add(df[["context"]].to_dict(orient="records"))
queries = df["query"].tolist() queries = df["query"].tolist()
``` ```
Now that we have the dataset and embeddings table set up, here's how you can run different query types on the dataset. Now that we have the dataset and embeddings table set up, here's how you can run different query types on the dataset:
* <b> Vector Search: </b> * <b> Vector Search: </b>
```python ```python
table.search(quries[0], query_type="vector").limit(5).to_pandas() table.search(quries[0], query_type="vector").limit(5).to_pandas()
``` ```
By default, LanceDB uses vector search query type for searching and it automatically converts the input query to a vector before searching when using embedding API. So, the following statement is equivalent to the above statement. By default, LanceDB uses vector search query type for searching and it automatically converts the input query to a vector before searching when using embedding API. So, the following statement is equivalent to the above statement:
```python ```python
table.search(quries[0]).limit(5).to_pandas() table.search(quries[0]).limit(5).to_pandas()
@@ -77,7 +77,7 @@ Now that we have the dataset and embeddings table set up, here's how you can run
* <b> Hybrid Search: </b> * <b> Hybrid Search: </b>
Hybrid search is a combination of vector and full-text search. Here's how you can run a hybrid search query on the dataset. Hybrid search is a combination of vector and full-text search. Here's how you can run a hybrid search query on the dataset:
```python ```python
table.search(quries[0], query_type="hybrid").limit(5).to_pandas() table.search(quries[0], query_type="hybrid").limit(5).to_pandas()
``` ```
@@ -87,7 +87,7 @@ Now that we have the dataset and embeddings table set up, here's how you can run
!!! note "Note" !!! note "Note"
By default, it uses `LinearCombinationReranker` that combines the scores from vector and full-text search using a weighted linear combination. It is the simplest reranker implementation available in LanceDB. You can also use other rerankers like `CrossEncoderReranker` or `CohereReranker` for reranking the results. By default, it uses `LinearCombinationReranker` that combines the scores from vector and full-text search using a weighted linear combination. It is the simplest reranker implementation available in LanceDB. You can also use other rerankers like `CrossEncoderReranker` or `CohereReranker` for reranking the results.
Learn more about rerankers [here](https://lancedb.github.io/lancedb/reranking/) Learn more about rerankers [here](https://lancedb.github.io/lancedb/reranking/).

View File

@@ -1,6 +1,6 @@
Continuing from the previous section, we can now rerank the results using more complex rerankers. Continuing from the previous section, we can now rerank the results using more complex rerankers.
Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/> Try it yourself: <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
## Reranking search results ## Reranking search results
You can rerank any search results using a reranker. The syntax for reranking is as follows: You can rerank any search results using a reranker. The syntax for reranking is as follows:
@@ -62,9 +62,6 @@ Let us take a look at the same datasets from the previous sections, using the sa
| Reranked fts | 0.672 | | Reranked fts | 0.672 |
| Hybrid | 0.759 | | Hybrid | 0.759 |
### SQuAD Dataset
### Uber10K sec filing Dataset ### Uber10K sec filing Dataset
| Query Type | Hit-rate@5 | | Query Type | Hit-rate@5 |

View File

@@ -1,5 +1,5 @@
## Finetuning the Embedding Model ## Finetuning the Embedding Model
Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/embedding_tuner.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/> Try it yourself: <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/embedding_tuner.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
Another way to improve retriever performance is to fine-tune the embedding model itself. Fine-tuning the embedding model can help in learning better representations for the documents and queries in the dataset. This can be particularly useful when the dataset is very different from the pre-trained data used to train the embedding model. Another way to improve retriever performance is to fine-tune the embedding model itself. Fine-tuning the embedding model can help in learning better representations for the documents and queries in the dataset. This can be particularly useful when the dataset is very different from the pre-trained data used to train the embedding model.
@@ -16,7 +16,7 @@ validation_df.to_csv("data_val.csv", index=False)
You can use any tuning API to fine-tune embedding models. In this example, we'll utilise Llama-index as it also comes with utilities for synthetic data generation and training the model. You can use any tuning API to fine-tune embedding models. In this example, we'll utilise Llama-index as it also comes with utilities for synthetic data generation and training the model.
Then parse the dataset as llama-index text nodes and generate synthetic QA pairs from each node. We parse the dataset as llama-index text nodes and generate synthetic QA pairs from each node:
```python ```python
from llama_index.core.node_parser import SentenceSplitter from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PagedCSVReader from llama_index.readers.file import PagedCSVReader
@@ -43,7 +43,7 @@ val_dataset = generate_qa_embedding_pairs(
) )
``` ```
Now we'll use `SentenceTransformersFinetuneEngine` engine to fine-tune the model. You can also use `sentence-transformers` or `transformers` library to fine-tune the model. Now we'll use `SentenceTransformersFinetuneEngine` engine to fine-tune the model. You can also use `sentence-transformers` or `transformers` library to fine-tune the model:
```python ```python
from llama_index.finetuning import SentenceTransformersFinetuneEngine from llama_index.finetuning import SentenceTransformersFinetuneEngine
@@ -57,7 +57,7 @@ finetune_engine = SentenceTransformersFinetuneEngine(
finetune_engine.finetune() finetune_engine.finetune()
embed_model = finetune_engine.get_finetuned_model() embed_model = finetune_engine.get_finetuned_model()
``` ```
This saves the fine tuned embedding model in `tuned_model` folder. This al This saves the fine tuned embedding model in `tuned_model` folder.
# Evaluation results # Evaluation results
In order to eval the retriever, you can either use this model to ingest the data into LanceDB directly or llama-index's LanceDB integration to create a `VectorStoreIndex` and use it as a retriever. In order to eval the retriever, you can either use this model to ingest the data into LanceDB directly or llama-index's LanceDB integration to create a `VectorStoreIndex` and use it as a retriever.

View File

@@ -3,22 +3,22 @@
Hybrid Search is a broad (often misused) term. It can mean anything from combining multiple methods for searching, to applying ranking methods to better sort the results. In this blog, we use the definition of "hybrid search" to mean using a combination of keyword-based and vector search. Hybrid Search is a broad (often misused) term. It can mean anything from combining multiple methods for searching, to applying ranking methods to better sort the results. In this blog, we use the definition of "hybrid search" to mean using a combination of keyword-based and vector search.
## The challenge of (re)ranking search results ## The challenge of (re)ranking search results
Once you have a group of the most relevant search results from multiple search sources, you'd likely standardize the score and rank them accordingly. This process can also be seen as another independent step-reranking. Once you have a group of the most relevant search results from multiple search sources, you'd likely standardize the score and rank them accordingly. This process can also be seen as another independent step:reranking.
There are two approaches for reranking search results from multiple sources. There are two approaches for reranking search results from multiple sources.
* <b>Score-based</b>: Calculate final relevance scores based on a weighted linear combination of individual search algorithm scores. Example-Weighted linear combination of semantic search & keyword-based search results. * <b>Score-based</b>: Calculate final relevance scores based on a weighted linear combination of individual search algorithm scores. Example:Weighted linear combination of semantic search & keyword-based search results.
* <b>Relevance-based</b>: Discards the existing scores and calculates the relevance of each search result-query pair. Example-Cross Encoder models * <b>Relevance-based</b>: Discards the existing scores and calculates the relevance of each search result-query pair. Example:Cross Encoder models
Even though there are many strategies for reranking search results, none works for all cases. Moreover, evaluating them itself is a challenge. Also, reranking can be dataset, application specific so it's hard to generalize. Even though there are many strategies for reranking search results, none works for all cases. Moreover, evaluating them itself is a challenge. Also, reranking can be dataset or application specific so it's hard to generalize.
### Example evaluation of hybrid search with Reranking ### Example evaluation of hybrid search with Reranking
Here's some evaluation numbers from experiment comparing these re-rankers on about 800 queries. It is modified version of an evaluation script from [llama-index](https://github.com/run-llama/finetune-embedding/blob/main/evaluate.ipynb) that measures hit-rate at top-k. Here's some evaluation numbers from an experiment comparing these rerankers on about 800 queries. It is modified version of an evaluation script from [llama-index](https://github.com/run-llama/finetune-embedding/blob/main/evaluate.ipynb) that measures hit-rate at top-k.
<b> With OpenAI ada2 embedding </b> <b> With OpenAI ada2 embedding </b>
Vector Search baseline - `0.64` Vector Search baseline: `0.64`
| Reranker | Top-3 | Top-5 | Top-10 | | Reranker | Top-3 | Top-5 | Top-10 |
| --- | --- | --- | --- | | --- | --- | --- | --- |
@@ -33,7 +33,7 @@ Vector Search baseline - `0.64`
<b> With OpenAI embedding-v3-small </b> <b> With OpenAI embedding-v3-small </b>
Vector Search baseline - `0.59` Vector Search baseline: `0.59`
| Reranker | Top-3 | Top-5 | Top-10 | | Reranker | Top-3 | Top-5 | Top-10 |
| --- | --- | --- | --- | | --- | --- | --- | --- |

View File

@@ -5,56 +5,45 @@ LanceDB supports both semantic and keyword-based search (also termed full-text s
## Hybrid search in LanceDB ## Hybrid search in LanceDB
You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB provides multiple rerankers out of the box. However, you can always write a custom reranker if your use case need more sophisticated logic . You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice. LanceDB provides multiple rerankers out of the box. However, you can always write a custom reranker if your use case need more sophisticated logic .
=== "Sync API"
```python ```python
import os --8<-- "python/python/tests/docs/test_search.py:import-os"
--8<-- "python/python/tests/docs/test_search.py:import-openai"
import lancedb --8<-- "python/python/tests/docs/test_search.py:import-lancedb"
import openai --8<-- "python/python/tests/docs/test_search.py:import-embeddings"
from lancedb.embeddings import get_registry --8<-- "python/python/tests/docs/test_search.py:import-pydantic"
from lancedb.pydantic import LanceModel, Vector --8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:import-openai-embeddings"
db = lancedb.connect("~/.lancedb") --8<-- "python/python/tests/docs/test_search.py:class-Documents"
--8<-- "python/python/tests/docs/test_search.py:basic_hybrid_search"
# Ingest embedding function in LanceDB table
# Configuring the environment variable OPENAI_API_KEY
if "OPENAI_API_KEY" not in os.environ:
# OR set the key here as a variable
openai.api_key = "sk-..."
embeddings = get_registry().get("openai").create()
class Documents(LanceModel):
vector: Vector(embeddings.ndims()) = embeddings.VectorField()
text: str = embeddings.SourceField()
table = db.create_table("documents", schema=Documents)
data = [
{ "text": "rebel spaceships striking from a hidden base"},
{ "text": "have won their first victory against the evil Galactic Empire"},
{ "text": "during the battle rebel spies managed to steal secret plans"},
{ "text": "to the Empire's ultimate weapon the Death Star"}
]
# ingest docs with auto-vectorization
table.add(data)
# Create a fts index before the hybrid search
table.create_fts_index("text")
# hybrid search with default re-ranker
results = table.search("flower moon", query_type="hybrid").to_pandas()
``` ```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:import-os"
--8<-- "python/python/tests/docs/test_search.py:import-openai"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb"
--8<-- "python/python/tests/docs/test_search.py:import-embeddings"
--8<-- "python/python/tests/docs/test_search.py:import-pydantic"
--8<-- "python/python/tests/docs/test_search.py:import-lancedb-fts"
--8<-- "python/python/tests/docs/test_search.py:import-openai-embeddings"
--8<-- "python/python/tests/docs/test_search.py:class-Documents"
--8<-- "python/python/tests/docs/test_search.py:basic_hybrid_search_async"
```
!!! Note !!! Note
You can also pass the vector and text query manually. This is useful if you're not using the embedding API or if you're using a separate embedder service. You can also pass the vector and text query manually. This is useful if you're not using the embedding API or if you're using a separate embedder service.
### Explicitly passing the vector and text query ### Explicitly passing the vector and text query
```python === "Sync API"
vector_query = [0.1, 0.2, 0.3, 0.4, 0.5]
text_query = "flower moon"
results = table.search(query_type="hybrid")
.vector(vector_query)
.text(text_query)
.limit(5)
.to_pandas()
```python
--8<-- "python/python/tests/docs/test_search.py:hybrid_search_pass_vector_text"
```
=== "Async API"
```python
--8<-- "python/python/tests/docs/test_search.py:hybrid_search_pass_vector_text_async"
``` ```
By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers: By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion score, to combine and rerank the results of semantic and full-text search. You can customize the hyperparameters as needed or write your own custom reranker. Here's how you can use any of the available rerankers:
@@ -68,7 +57,7 @@ By default, LanceDB uses `RRFReranker()`, which uses reciprocal rank fusion scor
## Available Rerankers ## Available Rerankers
LanceDB provides a number of re-rankers out of the box. You can use any of these re-rankers by passing them to the `rerank()` method. LanceDB provides a number of rerankers out of the box. You can use any of these rerankers by passing them to the `rerank()` method.
Go to [Rerankers](../reranking/index.md) to learn more about using the available rerankers and implementing custom rerankers. Go to [Rerankers](../reranking/index.md) to learn more about using the available rerankers and implementing custom rerankers.

View File

@@ -4,6 +4,9 @@ LanceDB is an open-source vector database for AI that's designed to store, manag
Both the database and the underlying data format are designed from the ground up to be **easy-to-use**, **scalable** and **cost-effective**. Both the database and the underlying data format are designed from the ground up to be **easy-to-use**, **scalable** and **cost-effective**.
!!! tip "Hosted LanceDB"
If you want S3 cost-efficiency and local performance via a simple serverless API, checkout **LanceDB Cloud**. For private deployments, high performance at extreme scale, or if you have strict security requirements, talk to us about **LanceDB Enterprise**. [Learn more](https://docs.lancedb.com/)
![](assets/lancedb_and_lance.png) ![](assets/lancedb_and_lance.png)
## Truly multi-modal ## Truly multi-modal
@@ -20,7 +23,7 @@ LanceDB **OSS** is an **open-source**, batteries-included embedded vector databa
LanceDB **Cloud** is a SaaS (software-as-a-service) solution that runs serverless in the cloud, making the storage clearly separated from compute. It's designed to be cost-effective and highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below. LanceDB **Cloud** is a SaaS (software-as-a-service) solution that runs serverless in the cloud, making the storage clearly separated from compute. It's designed to be cost-effective and highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below.
[Try out LanceDB Cloud](https://noteforms.com/forms/lancedb-mailing-list-cloud-kty1o5?notionforms=1&utm_source=notionforms){ .md-button .md-button--primary } [Try out LanceDB Cloud (Public Beta) Now](https://cloud.lancedb.com){ .md-button .md-button--primary }
## Why use LanceDB? ## Why use LanceDB?

View File

@@ -108,7 +108,7 @@ This method creates a scalar(for non-vector cols) or a vector index on a table.
|:---|:---|:---|:---| |:---|:---|:---|:---|
|`vector_col`|`Optional[str]`| Provide if you want to create index on a vector column. |`None`| |`vector_col`|`Optional[str]`| Provide if you want to create index on a vector column. |`None`|
|`col_name`|`Optional[str]`| Provide if you want to create index on a non-vector column. |`None`| |`col_name`|`Optional[str]`| Provide if you want to create index on a non-vector column. |`None`|
|`metric`|`Optional[str]` |Provide the metric to use for vector index. choice of metrics: 'L2', 'dot', 'cosine'. |`L2`| |`metric`|`Optional[str]` |Provide the metric to use for vector index. choice of metrics: 'l2', 'dot', 'cosine'. |`l2`|
|`num_partitions`|`Optional[int]`|Number of partitions to use for the index.|`256`| |`num_partitions`|`Optional[int]`|Number of partitions to use for the index.|`256`|
|`num_sub_vectors`|`Optional[int]` |Number of sub-vectors to use for the index.|`96`| |`num_sub_vectors`|`Optional[int]` |Number of sub-vectors to use for the index.|`96`|
|`index_cache_size`|`Optional[int]` |Size of the index cache.|`None`| |`index_cache_size`|`Optional[int]` |Size of the index cache.|`None`|

View File

@@ -125,7 +125,7 @@ The exhaustive list of parameters for `LanceDBVectorStore` vector store are :
``` ```
- **_table_exists(self, tbl_name: `Optional[str]` = `None`) -> `bool`** : Returns `True` if `tbl_name` exists in database. - **_table_exists(self, tbl_name: `Optional[str]` = `None`) -> `bool`** : Returns `True` if `tbl_name` exists in database.
- __create_index( - __create_index(
self, scalar: `Optional[bool]` = False, col_name: `Optional[str]` = None, num_partitions: `Optional[int]` = 256, num_sub_vectors: `Optional[int]` = 96, index_cache_size: `Optional[int]` = None, metric: `Optional[str]` = "L2", self, scalar: `Optional[bool]` = False, col_name: `Optional[str]` = None, num_partitions: `Optional[int]` = 256, num_sub_vectors: `Optional[int]` = 96, index_cache_size: `Optional[int]` = None, metric: `Optional[str]` = "l2",
) -> `None`__ : Creates a scalar(for non-vector cols) or a vector index on a table. ) -> `None`__ : Creates a scalar(for non-vector cols) or a vector index on a table.
Make sure your vector column has enough data before creating an index on it. Make sure your vector column has enough data before creating an index on it.

View File

@@ -10,7 +10,7 @@ Distance metrics type.
- [Cosine](MetricType.md#cosine) - [Cosine](MetricType.md#cosine)
- [Dot](MetricType.md#dot) - [Dot](MetricType.md#dot)
- [L2](MetricType.md#l2) - [l2](MetricType.md#l2)
## Enumeration Members ## Enumeration Members

View File

@@ -85,7 +85,7 @@ ___
`Optional` **metric\_type**: [`MetricType`](../enums/MetricType.md) `Optional` **metric\_type**: [`MetricType`](../enums/MetricType.md)
Metric type, L2 or Cosine Metric type, l2 or Cosine
#### Defined in #### Defined in

View File

@@ -15,11 +15,9 @@ npm install @lancedb/lancedb
This will download the appropriate native library for your platform. We currently This will download the appropriate native library for your platform. We currently
support: support:
- Linux (x86_64 and aarch64) - Linux (x86_64 and aarch64 on glibc and musl)
- MacOS (Intel and ARM/M1/M2) - MacOS (Intel and ARM/M1/M2)
- Windows (x86_64 only) - Windows (x86_64 and aarch64)
We do not yet support musl-based Linux (such as Alpine Linux) or aarch64 Windows.
## Usage ## Usage
@@ -36,41 +34,8 @@ const results = await table.vectorSearch([0.1, 0.3]).limit(20).toArray();
console.log(results); console.log(results);
``` ```
The [quickstart](../basic.md) contains a more complete example. The [quickstart](https://lancedb.github.io/lancedb/basic/) contains a more complete example.
## Development ## Development
```sh See [CONTRIBUTING.md](_media/CONTRIBUTING.md) for information on how to contribute to LanceDB.
npm run build
npm run test
```
### Running lint / format
LanceDb uses [biome](https://biomejs.dev/) for linting and formatting. if you are using VSCode you will need to install the official [Biome](https://marketplace.visualstudio.com/items?itemName=biomejs.biome) extension.
To manually lint your code you can run:
```sh
npm run lint
```
to automatically fix all fixable issues:
```sh
npm run lint-fix
```
If you do not have your workspace root set to the `nodejs` directory, unfortunately the extension will not work. You can still run the linting and formatting commands manually.
### Generating docs
```sh
npm run docs
cd ../docs
# Asssume the virtual environment was created
# python3 -m venv venv
# pip install -r requirements.txt
. ./venv/bin/activate
mkdocs build
```

View File

@@ -0,0 +1,76 @@
# Contributing to LanceDB Typescript
This document outlines the process for contributing to LanceDB Typescript.
For general contribution guidelines, see [CONTRIBUTING.md](../CONTRIBUTING.md).
## Project layout
The Typescript package is a wrapper around the Rust library, `lancedb`. We use
the [napi-rs](https://napi.rs/) library to create the bindings between Rust and
Typescript.
* `src/`: Rust bindings source code
* `lancedb/`: Typescript package source code
* `__test__/`: Unit tests
* `examples/`: An npm package with the examples shown in the documentation
## Development environment
To set up your development environment, you will need to install the following:
1. Node.js 14 or later
2. Rust's package manager, Cargo. Use [rustup](https://rustup.rs/) to install.
3. [protoc](https://grpc.io/docs/protoc-installation/) (Protocol Buffers compiler)
Initial setup:
```shell
npm install
```
### Commit Hooks
It is **highly recommended** to install the [pre-commit](https://pre-commit.com/) hooks to ensure that your
code is formatted correctly and passes basic checks before committing:
```shell
pre-commit install
```
## Development
Most common development commands can be run using the npm scripts.
Build the package
```shell
npm install
npm run build
```
Lint:
```shell
npm run lint
```
Format and fix lints:
```shell
npm run lint-fix
```
Run tests:
```shell
npm test
```
To run a single test:
```shell
# Single file: table.test.ts
npm test -- table.test.ts
# Single test: 'merge insert' in table.test.ts
npm test -- table.test.ts --testNamePattern=merge\ insert
```

View File

@@ -0,0 +1,67 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / BoostQuery
# Class: BoostQuery
Represents a full-text query interface.
This interface defines the structure and behavior for full-text queries,
including methods to retrieve the query type and convert the query to a dictionary format.
## Implements
- [`FullTextQuery`](../interfaces/FullTextQuery.md)
## Constructors
### new BoostQuery()
```ts
new BoostQuery(
positive,
negative,
options?): BoostQuery
```
Creates an instance of BoostQuery.
The boost returns documents that match the positive query,
but penalizes those that match the negative query.
the penalty is controlled by the `negativeBoost` parameter.
#### Parameters
* **positive**: [`FullTextQuery`](../interfaces/FullTextQuery.md)
The positive query that boosts the relevance score.
* **negative**: [`FullTextQuery`](../interfaces/FullTextQuery.md)
The negative query that reduces the relevance score.
* **options?**
Optional parameters for the boost query.
- `negativeBoost`: The boost factor for the negative query (default is 0.0).
* **options.negativeBoost?**: `number`
#### Returns
[`BoostQuery`](BoostQuery.md)
## Methods
### queryType()
```ts
queryType(): FullTextQueryType
```
The type of the full-text query.
#### Returns
[`FullTextQueryType`](../enumerations/FullTextQueryType.md)
#### Implementation of
[`FullTextQuery`](../interfaces/FullTextQuery.md).[`queryType`](../interfaces/FullTextQuery.md#querytype)

View File

@@ -23,18 +23,6 @@ be closed when they are garbage collected.
Any created tables are independent and will continue to work even if Any created tables are independent and will continue to work even if
the underlying connection has been closed. the underlying connection has been closed.
## Constructors
### new Connection()
```ts
new Connection(): Connection
```
#### Returns
[`Connection`](Connection.md)
## Methods ## Methods
### close() ### close()
@@ -71,7 +59,7 @@ Creates a new empty Table
* **name**: `string` * **name**: `string`
The name of the table. The name of the table.
* **schema**: `SchemaLike` * **schema**: [`SchemaLike`](../type-aliases/SchemaLike.md)
The schema of the table The schema of the table
* **options?**: `Partial`&lt;[`CreateTableOptions`](../interfaces/CreateTableOptions.md)&gt; * **options?**: `Partial`&lt;[`CreateTableOptions`](../interfaces/CreateTableOptions.md)&gt;
@@ -117,7 +105,7 @@ Creates a new Table and initialize it with new data.
* **name**: `string` * **name**: `string`
The name of the table. The name of the table.
* **data**: `TableLike` \| `Record`&lt;`string`, `unknown`&gt;[] * **data**: [`TableLike`](../type-aliases/TableLike.md) \| `Record`&lt;`string`, `unknown`&gt;[]
Non-empty Array of Records Non-empty Array of Records
to be inserted into the table to be inserted into the table
@@ -143,6 +131,20 @@ Return a brief description of the connection
*** ***
### dropAllTables()
```ts
abstract dropAllTables(): Promise<void>
```
Drop all tables in the database.
#### Returns
`Promise`&lt;`void`&gt;
***
### dropTable() ### dropTable()
```ts ```ts
@@ -189,7 +191,7 @@ Open a table in the database.
* **name**: `string` * **name**: `string`
The name of the table The name of the table
* **options?**: `Partial`&lt;`OpenTableOptions`&gt; * **options?**: `Partial`&lt;[`OpenTableOptions`](../interfaces/OpenTableOptions.md)&gt;
#### Returns #### Returns

View File

@@ -72,11 +72,9 @@ The results of a full text search are ordered by relevance measured by BM25.
You can combine filters with full text search. You can combine filters with full text search.
For now, the full text search index only supports English, and doesn't support phrase search.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`FtsOptions`&gt; * **options?**: `Partial`&lt;[`FtsOptions`](../interfaces/FtsOptions.md)&gt;
#### Returns #### Returns
@@ -98,7 +96,7 @@ the vectors.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`HnswPqOptions`&gt; * **options?**: `Partial`&lt;[`HnswPqOptions`](../interfaces/HnswPqOptions.md)&gt;
#### Returns #### Returns
@@ -120,7 +118,38 @@ the vectors.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`HnswSqOptions`&gt; * **options?**: `Partial`&lt;[`HnswSqOptions`](../interfaces/HnswSqOptions.md)&gt;
#### Returns
[`Index`](Index.md)
***
### ivfFlat()
```ts
static ivfFlat(options?): Index
```
Create an IvfFlat index
This index groups vectors into partitions of similar vectors. Each partition keeps track of
a centroid which is the average value of all vectors in the group.
During a query the centroids are compared with the query vector to find the closest
partitions. The vectors in these partitions are then searched to find
the closest vectors.
The partitioning process is called IVF and the `num_partitions` parameter controls how
many groups to create.
Note that training an IVF FLAT index on a large dataset is a slow operation and
currently is also a memory intensive operation.
#### Parameters
* **options?**: `Partial`&lt;[`IvfFlatOptions`](../interfaces/IvfFlatOptions.md)&gt;
#### Returns #### Returns

View File

@@ -0,0 +1,70 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / MatchQuery
# Class: MatchQuery
Represents a full-text query interface.
This interface defines the structure and behavior for full-text queries,
including methods to retrieve the query type and convert the query to a dictionary format.
## Implements
- [`FullTextQuery`](../interfaces/FullTextQuery.md)
## Constructors
### new MatchQuery()
```ts
new MatchQuery(
query,
column,
options?): MatchQuery
```
Creates an instance of MatchQuery.
#### Parameters
* **query**: `string`
The text query to search for.
* **column**: `string`
The name of the column to search within.
* **options?**
Optional parameters for the match query.
- `boost`: The boost factor for the query (default is 1.0).
- `fuzziness`: The fuzziness level for the query (default is 0).
- `maxExpansions`: The maximum number of terms to consider for fuzzy matching (default is 50).
* **options.boost?**: `number`
* **options.fuzziness?**: `number`
* **options.maxExpansions?**: `number`
#### Returns
[`MatchQuery`](MatchQuery.md)
## Methods
### queryType()
```ts
queryType(): FullTextQueryType
```
The type of the full-text query.
#### Returns
[`FullTextQueryType`](../enumerations/FullTextQueryType.md)
#### Implementation of
[`FullTextQuery`](../interfaces/FullTextQuery.md).[`queryType`](../interfaces/FullTextQuery.md#querytype)

View File

@@ -0,0 +1,126 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / MergeInsertBuilder
# Class: MergeInsertBuilder
A builder used to create and run a merge insert operation
## Constructors
### new MergeInsertBuilder()
```ts
new MergeInsertBuilder(native, schema): MergeInsertBuilder
```
Construct a MergeInsertBuilder. __Internal use only.__
#### Parameters
* **native**: `NativeMergeInsertBuilder`
* **schema**: `Schema`&lt;`any`&gt; \| `Promise`&lt;`Schema`&lt;`any`&gt;&gt;
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
## Methods
### execute()
```ts
execute(data): Promise<MergeStats>
```
Executes the merge insert operation
#### Parameters
* **data**: [`Data`](../type-aliases/Data.md)
#### Returns
`Promise`&lt;[`MergeStats`](../interfaces/MergeStats.md)&gt;
Statistics about the merge operation: counts of inserted, updated, and deleted rows
***
### whenMatchedUpdateAll()
```ts
whenMatchedUpdateAll(options?): MergeInsertBuilder
```
Rows that exist in both the source table (new data) and
the target table (old data) will be updated, replacing
the old row with the corresponding matching row.
If there are multiple matches then the behavior is undefined.
Currently this causes multiple copies of the row to be created
but that behavior is subject to change.
An optional condition may be specified. If it is, then only
matched rows that satisfy the condtion will be updated. Any
rows that do not satisfy the condition will be left as they
are. Failing to satisfy the condition does not cause a
"matched row" to become a "not matched" row.
The condition should be an SQL string. Use the prefix
target. to refer to rows in the target table (old data)
and the prefix source. to refer to rows in the source
table (new data).
For example, "target.last_update < source.last_update"
#### Parameters
* **options?**
* **options.where?**: `string`
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### whenNotMatchedBySourceDelete()
```ts
whenNotMatchedBySourceDelete(options?): MergeInsertBuilder
```
Rows that exist only in the target table (old data) will be
deleted. An optional condition can be provided to limit what
data is deleted.
#### Parameters
* **options?**
* **options.where?**: `string`
An optional condition to limit what data is deleted
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### whenNotMatchedInsertAll()
```ts
whenNotMatchedInsertAll(): MergeInsertBuilder
```
Rows that exist only in the source table (new data) should
be inserted into the target table.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)

View File

@@ -0,0 +1,64 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / MultiMatchQuery
# Class: MultiMatchQuery
Represents a full-text query interface.
This interface defines the structure and behavior for full-text queries,
including methods to retrieve the query type and convert the query to a dictionary format.
## Implements
- [`FullTextQuery`](../interfaces/FullTextQuery.md)
## Constructors
### new MultiMatchQuery()
```ts
new MultiMatchQuery(
query,
columns,
options?): MultiMatchQuery
```
Creates an instance of MultiMatchQuery.
#### Parameters
* **query**: `string`
The text query to search for across multiple columns.
* **columns**: `string`[]
An array of column names to search within.
* **options?**
Optional parameters for the multi-match query.
- `boosts`: An array of boost factors for each column (default is 1.0 for all).
* **options.boosts?**: `number`[]
#### Returns
[`MultiMatchQuery`](MultiMatchQuery.md)
## Methods
### queryType()
```ts
queryType(): FullTextQueryType
```
The type of the full-text query.
#### Returns
[`FullTextQueryType`](../enumerations/FullTextQueryType.md)
#### Implementation of
[`FullTextQuery`](../interfaces/FullTextQuery.md).[`queryType`](../interfaces/FullTextQuery.md#querytype)

View File

@@ -0,0 +1,55 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / PhraseQuery
# Class: PhraseQuery
Represents a full-text query interface.
This interface defines the structure and behavior for full-text queries,
including methods to retrieve the query type and convert the query to a dictionary format.
## Implements
- [`FullTextQuery`](../interfaces/FullTextQuery.md)
## Constructors
### new PhraseQuery()
```ts
new PhraseQuery(query, column): PhraseQuery
```
Creates an instance of `PhraseQuery`.
#### Parameters
* **query**: `string`
The phrase to search for in the specified column.
* **column**: `string`
The name of the column to search within.
#### Returns
[`PhraseQuery`](PhraseQuery.md)
## Methods
### queryType()
```ts
queryType(): FullTextQueryType
```
The type of the full-text query.
#### Returns
[`FullTextQueryType`](../enumerations/FullTextQueryType.md)
#### Implementation of
[`FullTextQuery`](../interfaces/FullTextQuery.md).[`queryType`](../interfaces/FullTextQuery.md#querytype)

View File

@@ -8,30 +8,14 @@
A builder for LanceDB queries. A builder for LanceDB queries.
## See
[Table#query](Table.md#query), [Table#search](Table.md#search)
## Extends ## Extends
- [`QueryBase`](QueryBase.md)&lt;`NativeQuery`&gt; - [`QueryBase`](QueryBase.md)&lt;`NativeQuery`&gt;
## Constructors
### new Query()
```ts
new Query(tbl): Query
```
#### Parameters
* **tbl**: `Table`
#### Returns
[`Query`](Query.md)
#### Overrides
[`QueryBase`](QueryBase.md).[`constructor`](QueryBase.md#constructors)
## Properties ## Properties
### inner ### inner
@@ -46,39 +30,50 @@ protected inner: Query | Promise<Query>;
## Methods ## Methods
### \[asyncIterator\]() ### analyzePlan()
```ts ```ts
asyncIterator: AsyncIterator<RecordBatch<any>, any, undefined> analyzePlan(): Promise<string>
``` ```
Executes the query and returns the physical query plan annotated with runtime metrics.
This is useful for debugging and performance analysis, as it shows how the query was executed
and includes metrics such as elapsed time, rows processed, and I/O statistics.
#### Returns #### Returns
`AsyncIterator`&lt;`RecordBatch`&lt;`any`&gt;, `any`, `undefined`&gt; `Promise`&lt;`string`&gt;
#### Inherited from A query execution plan with runtime metrics for each step.
[`QueryBase`](QueryBase.md).[`[asyncIterator]`](QueryBase.md#%5Basynciterator%5D) #### Example
***
### doCall()
```ts ```ts
protected doCall(fn): void import * as lancedb from "@lancedb/lancedb"
const db = await lancedb.connect("./.lancedb");
const table = await db.createTable("my_table", [
{ vector: [1.1, 0.9], id: "1" },
]);
const plan = await table.query().nearestTo([0.5, 0.2]).analyzePlan();
Example output (with runtime metrics inlined):
AnalyzeExec verbose=true, metrics=[]
ProjectionExec: expr=[id@3 as id, vector@0 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=3.292µs]
Take: columns="vector, _rowid, _distance, (id)", metrics=[output_rows=1, elapsed_compute=66.001µs, batches_processed=1, bytes_read=8, iops=1, requests=1]
CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.333µs]
GlobalLimitExec: skip=0, fetch=10, metrics=[output_rows=1, elapsed_compute=167ns]
FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=8.542µs]
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], metrics=[output_rows=1, elapsed_compute=63.25µs, row_replacements=1]
KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=114.333µs, output_batches=1]
LanceScan: uri=/path/to/data, projection=[vector], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=1, elapsed_compute=103.626µs, bytes_read=549, iops=2, requests=2]
``` ```
#### Parameters
* **fn**
#### Returns
`void`
#### Inherited from #### Inherited from
[`QueryBase`](QueryBase.md).[`doCall`](QueryBase.md#docall) [`QueryBase`](QueryBase.md).[`analyzePlan`](QueryBase.md#analyzeplan)
*** ***
@@ -92,7 +87,7 @@ Execute the query and return the results as an
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -161,7 +156,7 @@ fastSearch(): this
Skip searching un-indexed data. This can make search faster, but will miss Skip searching un-indexed data. This can make search faster, but will miss
any data that is not yet indexed. any data that is not yet indexed.
Use lancedb.Table#optimize to index all un-indexed data. Use [Table#optimize](Table.md#optimize) to index all un-indexed data.
#### Returns #### Returns
@@ -189,7 +184,7 @@ A filter statement to be applied to this query.
`this` `this`
#### Alias #### See
where where
@@ -211,9 +206,9 @@ fullTextSearch(query, options?): this
#### Parameters #### Parameters
* **query**: `string` * **query**: `string` \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
* **options?**: `Partial`&lt;`FullTextSearchOptions`&gt; * **options?**: `Partial`&lt;[`FullTextSearchOptions`](../interfaces/FullTextSearchOptions.md)&gt;
#### Returns #### Returns
@@ -250,26 +245,6 @@ called then every valid row from the table will be returned.
*** ***
### nativeExecute()
```ts
protected nativeExecute(options?): Promise<RecordBatchIterator>
```
#### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt;
#### Returns
`Promise`&lt;`RecordBatchIterator`&gt;
#### Inherited from
[`QueryBase`](QueryBase.md).[`nativeExecute`](QueryBase.md#nativeexecute)
***
### nearestTo() ### nearestTo()
```ts ```ts
@@ -294,7 +269,7 @@ If there is more than one vector column you must use
#### Parameters #### Parameters
* **vector**: `IntoVector` * **vector**: [`IntoVector`](../type-aliases/IntoVector.md)
#### Returns #### Returns
@@ -334,7 +309,7 @@ nearestToText(query, columns?): Query
#### Parameters #### Parameters
* **query**: `string` * **query**: `string` \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
* **columns?**: `string`[] * **columns?**: `string`[]
@@ -427,7 +402,7 @@ Collect the results as an array of objects.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -449,7 +424,7 @@ Collect the results as an Arrow
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns

View File

@@ -8,6 +8,11 @@
Common methods supported by all query types Common methods supported by all query types
## See
- [Query](Query.md)
- [VectorQuery](VectorQuery.md)
## Extended by ## Extended by
- [`Query`](Query.md) - [`Query`](Query.md)
@@ -21,22 +26,6 @@ Common methods supported by all query types
- `AsyncIterable`&lt;`RecordBatch`&gt; - `AsyncIterable`&lt;`RecordBatch`&gt;
## Constructors
### new QueryBase()
```ts
protected new QueryBase<NativeQueryType>(inner): QueryBase<NativeQueryType>
```
#### Parameters
* **inner**: `NativeQueryType` \| `Promise`&lt;`NativeQueryType`&gt;
#### Returns
[`QueryBase`](QueryBase.md)&lt;`NativeQueryType`&gt;
## Properties ## Properties
### inner ### inner
@@ -47,36 +36,47 @@ protected inner: NativeQueryType | Promise<NativeQueryType>;
## Methods ## Methods
### \[asyncIterator\]() ### analyzePlan()
```ts ```ts
asyncIterator: AsyncIterator<RecordBatch<any>, any, undefined> analyzePlan(): Promise<string>
``` ```
Executes the query and returns the physical query plan annotated with runtime metrics.
This is useful for debugging and performance analysis, as it shows how the query was executed
and includes metrics such as elapsed time, rows processed, and I/O statistics.
#### Returns #### Returns
`AsyncIterator`&lt;`RecordBatch`&lt;`any`&gt;, `any`, `undefined`&gt; `Promise`&lt;`string`&gt;
#### Implementation of A query execution plan with runtime metrics for each step.
`AsyncIterable.[asyncIterator]` #### Example
***
### doCall()
```ts ```ts
protected doCall(fn): void import * as lancedb from "@lancedb/lancedb"
const db = await lancedb.connect("./.lancedb");
const table = await db.createTable("my_table", [
{ vector: [1.1, 0.9], id: "1" },
]);
const plan = await table.query().nearestTo([0.5, 0.2]).analyzePlan();
Example output (with runtime metrics inlined):
AnalyzeExec verbose=true, metrics=[]
ProjectionExec: expr=[id@3 as id, vector@0 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=3.292µs]
Take: columns="vector, _rowid, _distance, (id)", metrics=[output_rows=1, elapsed_compute=66.001µs, batches_processed=1, bytes_read=8, iops=1, requests=1]
CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.333µs]
GlobalLimitExec: skip=0, fetch=10, metrics=[output_rows=1, elapsed_compute=167ns]
FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=8.542µs]
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], metrics=[output_rows=1, elapsed_compute=63.25µs, row_replacements=1]
KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=114.333µs, output_batches=1]
LanceScan: uri=/path/to/data, projection=[vector], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=1, elapsed_compute=103.626µs, bytes_read=549, iops=2, requests=2]
``` ```
#### Parameters
* **fn**
#### Returns
`void`
*** ***
### execute() ### execute()
@@ -89,7 +89,7 @@ Execute the query and return the results as an
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -150,7 +150,7 @@ fastSearch(): this
Skip searching un-indexed data. This can make search faster, but will miss Skip searching un-indexed data. This can make search faster, but will miss
any data that is not yet indexed. any data that is not yet indexed.
Use lancedb.Table#optimize to index all un-indexed data. Use [Table#optimize](Table.md#optimize) to index all un-indexed data.
#### Returns #### Returns
@@ -174,7 +174,7 @@ A filter statement to be applied to this query.
`this` `this`
#### Alias #### See
where where
@@ -192,9 +192,9 @@ fullTextSearch(query, options?): this
#### Parameters #### Parameters
* **query**: `string` * **query**: `string` \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
* **options?**: `Partial`&lt;`FullTextSearchOptions`&gt; * **options?**: `Partial`&lt;[`FullTextSearchOptions`](../interfaces/FullTextSearchOptions.md)&gt;
#### Returns #### Returns
@@ -223,22 +223,6 @@ called then every valid row from the table will be returned.
*** ***
### nativeExecute()
```ts
protected nativeExecute(options?): Promise<RecordBatchIterator>
```
#### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt;
#### Returns
`Promise`&lt;`RecordBatchIterator`&gt;
***
### offset() ### offset()
```ts ```ts
@@ -314,7 +298,7 @@ Collect the results as an array of objects.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -332,7 +316,7 @@ Collect the results as an Arrow
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns

View File

@@ -14,21 +14,13 @@ will be freed when the Table is garbage collected. To eagerly free the cache yo
can call the `close` method. Once the Table is closed, it cannot be used for any can call the `close` method. Once the Table is closed, it cannot be used for any
further operations. further operations.
Tables are created using the methods [Connection#createTable](Connection.md#createtable)
and [Connection#createEmptyTable](Connection.md#createemptytable). Existing tables are opened
using [Connection#openTable](Connection.md#opentable).
Closing a table is optional. It not closed, it will be closed when it is garbage Closing a table is optional. It not closed, it will be closed when it is garbage
collected. collected.
## Constructors
### new Table()
```ts
new Table(): Table
```
#### Returns
[`Table`](Table.md)
## Accessors ## Accessors
### name ### name
@@ -125,8 +117,8 @@ wish to return to standard mode, call `checkoutLatest`.
#### Parameters #### Parameters
* **version**: `number` * **version**: `string` \| `number`
The version to checkout The version to checkout, could be version number or tag
#### Returns #### Returns
@@ -216,6 +208,9 @@ Indices on vector columns will speed up vector searches.
Indices on scalar columns will speed up filtering (in both Indices on scalar columns will speed up filtering (in both
vector and non-vector searches) vector and non-vector searches)
We currently don't support custom named indexes.
The index name will always be `${column}_idx`.
#### Parameters #### Parameters
* **column**: `string` * **column**: `string`
@@ -226,11 +221,6 @@ vector and non-vector searches)
`Promise`&lt;`void`&gt; `Promise`&lt;`void`&gt;
#### Note
We currently don't support custom named indexes,
The index name will always be `${column}_idx`
#### Examples #### Examples
```ts ```ts
@@ -317,6 +307,28 @@ then call ``cleanup_files`` to remove the old files.
*** ***
### dropIndex()
```ts
abstract dropIndex(name): Promise<void>
```
Drop an index from the table.
#### Parameters
* **name**: `string`
The name of the index.
This does not delete the index from disk, it just removes it from the table.
To delete the index, run [Table#optimize](Table.md#optimize) after dropping the index.
Use [Table.listIndices](Table.md#listindices) to find the names of the indices.
#### Returns
`Promise`&lt;`void`&gt;
***
### indexStats() ### indexStats()
```ts ```ts
@@ -336,6 +348,8 @@ List all the stats of a specified index
The stats of the index. If the index does not exist, it will return undefined The stats of the index. If the index does not exist, it will return undefined
Use [Table.listIndices](Table.md#listindices) to find the names of the indices.
*** ***
### isOpen() ### isOpen()
@@ -376,7 +390,7 @@ List all the versions of the table
#### Returns #### Returns
`Promise`&lt;`Version`[]&gt; `Promise`&lt;[`Version`](../interfaces/Version.md)[]&gt;
*** ***
@@ -392,7 +406,7 @@ abstract mergeInsert(on): MergeInsertBuilder
#### Returns #### Returns
`MergeInsertBuilder` [`MergeInsertBuilder`](MergeInsertBuilder.md)
*** ***
@@ -436,7 +450,29 @@ Modeled after ``VACUUM`` in PostgreSQL.
#### Returns #### Returns
`Promise`&lt;`OptimizeStats`&gt; `Promise`&lt;[`OptimizeStats`](../interfaces/OptimizeStats.md)&gt;
***
### prewarmIndex()
```ts
abstract prewarmIndex(name): Promise<void>
```
Prewarm an index in the table.
#### Parameters
* **name**: `string`
The name of the index.
This will load the index into memory. This may reduce the cold-start time for
future queries. If the index does not fit in the cache then this call may be
wasteful.
#### Returns
`Promise`&lt;`void`&gt;
*** ***
@@ -553,7 +589,7 @@ Get the schema of the table.
abstract search( abstract search(
query, query,
queryType?, queryType?,
ftsColumns?): VectorQuery | Query ftsColumns?): Query | VectorQuery
``` ```
Create a search query to find the nearest neighbors Create a search query to find the nearest neighbors
@@ -561,7 +597,7 @@ of the given query
#### Parameters #### Parameters
* **query**: `string` \| `IntoVector` * **query**: `string` \| [`IntoVector`](../type-aliases/IntoVector.md) \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
the query, a vector or string the query, a vector or string
* **queryType?**: `string` * **queryType?**: `string`
@@ -575,7 +611,51 @@ of the given query
#### Returns #### Returns
[`VectorQuery`](VectorQuery.md) \| [`Query`](Query.md) [`Query`](Query.md) \| [`VectorQuery`](VectorQuery.md)
***
### stats()
```ts
abstract stats(): Promise<TableStatistics>
```
Returns table and fragment statistics
#### Returns
`Promise`&lt;[`TableStatistics`](../interfaces/TableStatistics.md)&gt;
The table and fragment statistics
***
### tags()
```ts
abstract tags(): Promise<Tags>
```
Get a tags manager for this table.
Tags allow you to label specific versions of a table with a human-readable name.
The returned tags manager can be used to list, create, update, or delete tags.
#### Returns
`Promise`&lt;[`Tags`](Tags.md)&gt;
A tags manager for this table
#### Example
```typescript
const tagsManager = await table.tags();
await tagsManager.create("v1", 1);
const tags = await tagsManager.list();
console.log(tags); // { "v1": { version: 1, manifestSize: ... } }
```
*** ***
@@ -694,7 +774,7 @@ by `query`.
#### Parameters #### Parameters
* **vector**: `IntoVector` * **vector**: [`IntoVector`](../type-aliases/IntoVector.md)
#### Returns #### Returns
@@ -720,35 +800,23 @@ Retrieve the version of the table
*** ***
### parseTableData() ### waitForIndex()
```ts ```ts
static parseTableData( abstract waitForIndex(indexNames, timeoutSeconds): Promise<void>
data,
options?,
streaming?): Promise<object>
``` ```
Waits for asynchronous indexing to complete on the table.
#### Parameters #### Parameters
* **data**: `TableLike` \| `Record`&lt;`string`, `unknown`&gt;[] * **indexNames**: `string`[]
The name of the indices to wait for
* **options?**: `Partial`&lt;[`CreateTableOptions`](../interfaces/CreateTableOptions.md)&gt; * **timeoutSeconds**: `number`
The number of seconds to wait before timing out
* **streaming?**: `boolean` = `false` This will raise an error if the indices are not created and fully indexed within the timeout.
#### Returns #### Returns
`Promise`&lt;`object`&gt; `Promise`&lt;`void`&gt;
##### buf
```ts
buf: Buffer;
```
##### mode
```ts
mode: string;
```

View File

@@ -0,0 +1,35 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / TagContents
# Class: TagContents
## Constructors
### new TagContents()
```ts
new TagContents(): TagContents
```
#### Returns
[`TagContents`](TagContents.md)
## Properties
### manifestSize
```ts
manifestSize: number;
```
***
### version
```ts
version: number;
```

View File

@@ -0,0 +1,99 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / Tags
# Class: Tags
## Constructors
### new Tags()
```ts
new Tags(): Tags
```
#### Returns
[`Tags`](Tags.md)
## Methods
### create()
```ts
create(tag, version): Promise<void>
```
#### Parameters
* **tag**: `string`
* **version**: `number`
#### Returns
`Promise`&lt;`void`&gt;
***
### delete()
```ts
delete(tag): Promise<void>
```
#### Parameters
* **tag**: `string`
#### Returns
`Promise`&lt;`void`&gt;
***
### getVersion()
```ts
getVersion(tag): Promise<number>
```
#### Parameters
* **tag**: `string`
#### Returns
`Promise`&lt;`number`&gt;
***
### list()
```ts
list(): Promise<Record<string, TagContents>>
```
#### Returns
`Promise`&lt;`Record`&lt;`string`, [`TagContents`](TagContents.md)&gt;&gt;
***
### update()
```ts
update(tag, version): Promise<void>
```
#### Parameters
* **tag**: `string`
* **version**: `number`
#### Returns
`Promise`&lt;`void`&gt;

View File

@@ -10,30 +10,14 @@ A builder used to construct a vector search
This builder can be reused to execute the query many times. This builder can be reused to execute the query many times.
## See
[Query#nearestTo](Query.md#nearestto)
## Extends ## Extends
- [`QueryBase`](QueryBase.md)&lt;`NativeVectorQuery`&gt; - [`QueryBase`](QueryBase.md)&lt;`NativeVectorQuery`&gt;
## Constructors
### new VectorQuery()
```ts
new VectorQuery(inner): VectorQuery
```
#### Parameters
* **inner**: `VectorQuery` \| `Promise`&lt;`VectorQuery`&gt;
#### Returns
[`VectorQuery`](VectorQuery.md)
#### Overrides
[`QueryBase`](QueryBase.md).[`constructor`](QueryBase.md#constructors)
## Properties ## Properties
### inner ### inner
@@ -48,22 +32,6 @@ protected inner: VectorQuery | Promise<VectorQuery>;
## Methods ## Methods
### \[asyncIterator\]()
```ts
asyncIterator: AsyncIterator<RecordBatch<any>, any, undefined>
```
#### Returns
`AsyncIterator`&lt;`RecordBatch`&lt;`any`&gt;, `any`, `undefined`&gt;
#### Inherited from
[`QueryBase`](QueryBase.md).[`[asyncIterator]`](QueryBase.md#%5Basynciterator%5D)
***
### addQueryVector() ### addQueryVector()
```ts ```ts
@@ -72,7 +40,7 @@ addQueryVector(vector): VectorQuery
#### Parameters #### Parameters
* **vector**: `IntoVector` * **vector**: [`IntoVector`](../type-aliases/IntoVector.md)
#### Returns #### Returns
@@ -80,6 +48,53 @@ addQueryVector(vector): VectorQuery
*** ***
### analyzePlan()
```ts
analyzePlan(): Promise<string>
```
Executes the query and returns the physical query plan annotated with runtime metrics.
This is useful for debugging and performance analysis, as it shows how the query was executed
and includes metrics such as elapsed time, rows processed, and I/O statistics.
#### Returns
`Promise`&lt;`string`&gt;
A query execution plan with runtime metrics for each step.
#### Example
```ts
import * as lancedb from "@lancedb/lancedb"
const db = await lancedb.connect("./.lancedb");
const table = await db.createTable("my_table", [
{ vector: [1.1, 0.9], id: "1" },
]);
const plan = await table.query().nearestTo([0.5, 0.2]).analyzePlan();
Example output (with runtime metrics inlined):
AnalyzeExec verbose=true, metrics=[]
ProjectionExec: expr=[id@3 as id, vector@0 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=3.292µs]
Take: columns="vector, _rowid, _distance, (id)", metrics=[output_rows=1, elapsed_compute=66.001µs, batches_processed=1, bytes_read=8, iops=1, requests=1]
CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.333µs]
GlobalLimitExec: skip=0, fetch=10, metrics=[output_rows=1, elapsed_compute=167ns]
FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=8.542µs]
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], metrics=[output_rows=1, elapsed_compute=63.25µs, row_replacements=1]
KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=114.333µs, output_batches=1]
LanceScan: uri=/path/to/data, projection=[vector], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=1, elapsed_compute=103.626µs, bytes_read=549, iops=2, requests=2]
```
#### Inherited from
[`QueryBase`](QueryBase.md).[`analyzePlan`](QueryBase.md#analyzeplan)
***
### bypassVectorIndex() ### bypassVectorIndex()
```ts ```ts
@@ -128,6 +143,24 @@ whose data type is a fixed-size-list of floats.
*** ***
### distanceRange()
```ts
distanceRange(lowerBound?, upperBound?): VectorQuery
```
#### Parameters
* **lowerBound?**: `number`
* **upperBound?**: `number`
#### Returns
[`VectorQuery`](VectorQuery.md)
***
### distanceType() ### distanceType()
```ts ```ts
@@ -161,26 +194,6 @@ By default "l2" is used.
*** ***
### doCall()
```ts
protected doCall(fn): void
```
#### Parameters
* **fn**
#### Returns
`void`
#### Inherited from
[`QueryBase`](QueryBase.md).[`doCall`](QueryBase.md#docall)
***
### ef() ### ef()
```ts ```ts
@@ -215,7 +228,7 @@ Execute the query and return the results as an
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -284,7 +297,7 @@ fastSearch(): this
Skip searching un-indexed data. This can make search faster, but will miss Skip searching un-indexed data. This can make search faster, but will miss
any data that is not yet indexed. any data that is not yet indexed.
Use lancedb.Table#optimize to index all un-indexed data. Use [Table#optimize](Table.md#optimize) to index all un-indexed data.
#### Returns #### Returns
@@ -312,7 +325,7 @@ A filter statement to be applied to this query.
`this` `this`
#### Alias #### See
where where
@@ -334,9 +347,9 @@ fullTextSearch(query, options?): this
#### Parameters #### Parameters
* **query**: `string` * **query**: `string` \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
* **options?**: `Partial`&lt;`FullTextSearchOptions`&gt; * **options?**: `Partial`&lt;[`FullTextSearchOptions`](../interfaces/FullTextSearchOptions.md)&gt;
#### Returns #### Returns
@@ -373,26 +386,6 @@ called then every valid row from the table will be returned.
*** ***
### nativeExecute()
```ts
protected nativeExecute(options?): Promise<RecordBatchIterator>
```
#### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt;
#### Returns
`Promise`&lt;`RecordBatchIterator`&gt;
#### Inherited from
[`QueryBase`](QueryBase.md).[`nativeExecute`](QueryBase.md#nativeexecute)
***
### nprobes() ### nprobes()
```ts ```ts
@@ -528,6 +521,22 @@ distance between the query vector and the actual uncompressed vector.
*** ***
### rerank()
```ts
rerank(reranker): VectorQuery
```
#### Parameters
* **reranker**: [`Reranker`](../namespaces/rerankers/interfaces/Reranker.md)
#### Returns
[`VectorQuery`](VectorQuery.md)
***
### select() ### select()
```ts ```ts
@@ -591,7 +600,7 @@ Collect the results as an array of objects.
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns
@@ -613,7 +622,7 @@ Collect the results as an Arrow
#### Parameters #### Parameters
* **options?**: `Partial`&lt;`QueryExecutionOptions`&gt; * **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
#### Returns #### Returns

View File

@@ -0,0 +1,46 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FullTextQueryType
# Enumeration: FullTextQueryType
Enum representing the types of full-text queries supported.
- `Match`: Performs a full-text search for terms in the query string.
- `MatchPhrase`: Searches for an exact phrase match in the text.
- `Boost`: Boosts the relevance score of specific terms in the query.
- `MultiMatch`: Searches across multiple fields for the query terms.
## Enumeration Members
### Boost
```ts
Boost: "boost";
```
***
### Match
```ts
Match: "match";
```
***
### MatchPhrase
```ts
MatchPhrase: "match_phrase";
```
***
### MultiMatch
```ts
MultiMatch: "multi_match";
```

View File

@@ -1,33 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / WriteMode
# Enumeration: WriteMode
Write mode for writing a table.
## Enumeration Members
### Append
```ts
Append: "Append";
```
***
### Create
```ts
Create: "Create";
```
***
### Overwrite
```ts
Overwrite: "Overwrite";
```

View File

@@ -6,10 +6,10 @@
# Function: connect() # Function: connect()
## connect(uri, opts) ## connect(uri, options)
```ts ```ts
function connect(uri, opts?): Promise<Connection> function connect(uri, options?): Promise<Connection>
``` ```
Connect to a LanceDB instance at the given URI. Connect to a LanceDB instance at the given URI.
@@ -26,7 +26,8 @@ Accepted formats:
The uri of the database. If the database uri starts The uri of the database. If the database uri starts
with `db://` then it connects to a remote database. with `db://` then it connects to a remote database.
* **opts?**: `Partial`&lt;[`ConnectionOptions`](../interfaces/ConnectionOptions.md)&gt; * **options?**: `Partial`&lt;[`ConnectionOptions`](../interfaces/ConnectionOptions.md)&gt;
The options to use when connecting to the database
### Returns ### Returns
@@ -49,10 +50,10 @@ const conn = await connect(
}); });
``` ```
## connect(opts) ## connect(options)
```ts ```ts
function connect(opts): Promise<Connection> function connect(options): Promise<Connection>
``` ```
Connect to a LanceDB instance at the given URI. Connect to a LanceDB instance at the given URI.
@@ -65,7 +66,8 @@ Accepted formats:
### Parameters ### Parameters
* **opts**: `Partial`&lt;[`ConnectionOptions`](../interfaces/ConnectionOptions.md)&gt; & `object` * **options**: `Partial`&lt;[`ConnectionOptions`](../interfaces/ConnectionOptions.md)&gt; & `object`
The options to use when connecting to the database
### Returns ### Returns

View File

@@ -22,8 +22,6 @@ when creating a table or adding data to it)
This function converts an array of Record<String, any> (row-major JS objects) This function converts an array of Record<String, any> (row-major JS objects)
to an Arrow Table (a columnar structure) to an Arrow Table (a columnar structure)
Note that it currently does not support nulls.
If a schema is provided then it will be used to determine the resulting array If a schema is provided then it will be used to determine the resulting array
types. Fields will also be reordered to fit the order defined by the schema. types. Fields will also be reordered to fit the order defined by the schema.
@@ -31,6 +29,9 @@ If a schema is not provided then the types will be inferred and the field order
will be controlled by the order of properties in the first record. If a type will be controlled by the order of properties in the first record. If a type
is inferred it will always be nullable. is inferred it will always be nullable.
If not all fields are found in the data, then a subset of the schema will be
returned.
If the input is empty then a schema must be provided to create an empty table. If the input is empty then a schema must be provided to create an empty table.
When a schema is not specified then data types will be inferred. The inference When a schema is not specified then data types will be inferred. The inference
@@ -38,6 +39,7 @@ rules are as follows:
- boolean => Bool - boolean => Bool
- number => Float64 - number => Float64
- bigint => Int64
- String => Utf8 - String => Utf8
- Buffer => Binary - Buffer => Binary
- Record<String, any> => Struct - Record<String, any> => Struct
@@ -57,6 +59,7 @@ rules are as follows:
## Example ## Example
```ts
import { fromTableToBuffer, makeArrowTable } from "../arrow"; import { fromTableToBuffer, makeArrowTable } from "../arrow";
import { Field, FixedSizeList, Float16, Float32, Int32, Schema } from "apache-arrow"; import { Field, FixedSizeList, Float16, Float32, Int32, Schema } from "apache-arrow";
@@ -78,7 +81,6 @@ The `vectorColumns` option can be used to support other vector column
names and data types. names and data types.
```ts ```ts
const schema = new Schema([ const schema = new Schema([
new Field("a", new Float64()), new Field("a", new Float64()),
new Field("b", new Float64()), new Field("b", new Float64()),
@@ -97,8 +99,7 @@ const schema = new Schema([
You can specify the vector column types and names using the options as well You can specify the vector column types and names using the options as well
```typescript ```ts
const schema = new Schema([ const schema = new Schema([
new Field('a', new Float64()), new Field('a', new Float64()),
new Field('b', new Float64()), new Field('b', new Float64()),

View File

@@ -0,0 +1,19 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / packBits
# Function: packBits()
```ts
function packBits(data): number[]
```
## Parameters
* **data**: `number`[]
## Returns
`number`[]

View File

@@ -7,20 +7,28 @@
## Namespaces ## Namespaces
- [embedding](namespaces/embedding/README.md) - [embedding](namespaces/embedding/README.md)
- [rerankers](namespaces/rerankers/README.md)
## Enumerations ## Enumerations
- [WriteMode](enumerations/WriteMode.md) - [FullTextQueryType](enumerations/FullTextQueryType.md)
## Classes ## Classes
- [BoostQuery](classes/BoostQuery.md)
- [Connection](classes/Connection.md) - [Connection](classes/Connection.md)
- [Index](classes/Index.md) - [Index](classes/Index.md)
- [MakeArrowTableOptions](classes/MakeArrowTableOptions.md) - [MakeArrowTableOptions](classes/MakeArrowTableOptions.md)
- [MatchQuery](classes/MatchQuery.md)
- [MergeInsertBuilder](classes/MergeInsertBuilder.md)
- [MultiMatchQuery](classes/MultiMatchQuery.md)
- [PhraseQuery](classes/PhraseQuery.md)
- [Query](classes/Query.md) - [Query](classes/Query.md)
- [QueryBase](classes/QueryBase.md) - [QueryBase](classes/QueryBase.md)
- [RecordBatchIterator](classes/RecordBatchIterator.md) - [RecordBatchIterator](classes/RecordBatchIterator.md)
- [Table](classes/Table.md) - [Table](classes/Table.md)
- [TagContents](classes/TagContents.md)
- [Tags](classes/Tags.md)
- [VectorColumnOptions](classes/VectorColumnOptions.md) - [VectorColumnOptions](classes/VectorColumnOptions.md)
- [VectorQuery](classes/VectorQuery.md) - [VectorQuery](classes/VectorQuery.md)
@@ -30,25 +38,48 @@
- [AddDataOptions](interfaces/AddDataOptions.md) - [AddDataOptions](interfaces/AddDataOptions.md)
- [ClientConfig](interfaces/ClientConfig.md) - [ClientConfig](interfaces/ClientConfig.md)
- [ColumnAlteration](interfaces/ColumnAlteration.md) - [ColumnAlteration](interfaces/ColumnAlteration.md)
- [CompactionStats](interfaces/CompactionStats.md)
- [ConnectionOptions](interfaces/ConnectionOptions.md) - [ConnectionOptions](interfaces/ConnectionOptions.md)
- [CreateTableOptions](interfaces/CreateTableOptions.md) - [CreateTableOptions](interfaces/CreateTableOptions.md)
- [ExecutableQuery](interfaces/ExecutableQuery.md) - [ExecutableQuery](interfaces/ExecutableQuery.md)
- [FragmentStatistics](interfaces/FragmentStatistics.md)
- [FragmentSummaryStats](interfaces/FragmentSummaryStats.md)
- [FtsOptions](interfaces/FtsOptions.md)
- [FullTextQuery](interfaces/FullTextQuery.md)
- [FullTextSearchOptions](interfaces/FullTextSearchOptions.md)
- [HnswPqOptions](interfaces/HnswPqOptions.md)
- [HnswSqOptions](interfaces/HnswSqOptions.md)
- [IndexConfig](interfaces/IndexConfig.md) - [IndexConfig](interfaces/IndexConfig.md)
- [IndexOptions](interfaces/IndexOptions.md) - [IndexOptions](interfaces/IndexOptions.md)
- [IndexStatistics](interfaces/IndexStatistics.md) - [IndexStatistics](interfaces/IndexStatistics.md)
- [IvfFlatOptions](interfaces/IvfFlatOptions.md)
- [IvfPqOptions](interfaces/IvfPqOptions.md) - [IvfPqOptions](interfaces/IvfPqOptions.md)
- [MergeStats](interfaces/MergeStats.md)
- [OpenTableOptions](interfaces/OpenTableOptions.md)
- [OptimizeOptions](interfaces/OptimizeOptions.md) - [OptimizeOptions](interfaces/OptimizeOptions.md)
- [OptimizeStats](interfaces/OptimizeStats.md)
- [QueryExecutionOptions](interfaces/QueryExecutionOptions.md)
- [RemovalStats](interfaces/RemovalStats.md)
- [RetryConfig](interfaces/RetryConfig.md) - [RetryConfig](interfaces/RetryConfig.md)
- [TableNamesOptions](interfaces/TableNamesOptions.md) - [TableNamesOptions](interfaces/TableNamesOptions.md)
- [TableStatistics](interfaces/TableStatistics.md)
- [TimeoutConfig](interfaces/TimeoutConfig.md) - [TimeoutConfig](interfaces/TimeoutConfig.md)
- [UpdateOptions](interfaces/UpdateOptions.md) - [UpdateOptions](interfaces/UpdateOptions.md)
- [WriteOptions](interfaces/WriteOptions.md) - [Version](interfaces/Version.md)
## Type Aliases ## Type Aliases
- [Data](type-aliases/Data.md) - [Data](type-aliases/Data.md)
- [DataLike](type-aliases/DataLike.md)
- [FieldLike](type-aliases/FieldLike.md)
- [IntoSql](type-aliases/IntoSql.md)
- [IntoVector](type-aliases/IntoVector.md)
- [RecordBatchLike](type-aliases/RecordBatchLike.md)
- [SchemaLike](type-aliases/SchemaLike.md)
- [TableLike](type-aliases/TableLike.md)
## Functions ## Functions
- [connect](functions/connect.md) - [connect](functions/connect.md)
- [makeArrowTable](functions/makeArrowTable.md) - [makeArrowTable](functions/makeArrowTable.md)
- [packBits](functions/packBits.md)

View File

@@ -8,6 +8,14 @@
## Properties ## Properties
### extraHeaders?
```ts
optional extraHeaders: Record<string, string>;
```
***
### retryConfig? ### retryConfig?
```ts ```ts

View File

@@ -16,7 +16,7 @@ must be provided.
### dataType? ### dataType?
```ts ```ts
optional dataType: string; optional dataType: string | DataType<Type, any>;
``` ```
A new data type for the column. If not provided then the data type will not be changed. A new data type for the column. If not provided then the data type will not be changed.

View File

@@ -0,0 +1,49 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / CompactionStats
# Interface: CompactionStats
Statistics about a compaction operation.
## Properties
### filesAdded
```ts
filesAdded: number;
```
The number of new, compacted data files added
***
### filesRemoved
```ts
filesRemoved: number;
```
The number of data files removed
***
### fragmentsAdded
```ts
fragmentsAdded: number;
```
The number of new, compacted fragments added
***
### fragmentsRemoved
```ts
fragmentsRemoved: number;
```
The number of fragments removed

View File

@@ -8,7 +8,7 @@
## Properties ## Properties
### dataStorageVersion? ### ~~dataStorageVersion?~~
```ts ```ts
optional dataStorageVersion: string; optional dataStorageVersion: string;
@@ -19,6 +19,10 @@ The version of the data storage format to use.
The default is `stable`. The default is `stable`.
Set to "legacy" to use the old format. Set to "legacy" to use the old format.
#### Deprecated
Pass `new_table_data_storage_version` to storageOptions instead.
*** ***
### embeddingFunction? ### embeddingFunction?
@@ -29,7 +33,7 @@ optional embeddingFunction: EmbeddingFunctionConfig;
*** ***
### enableV2ManifestPaths? ### ~~enableV2ManifestPaths?~~
```ts ```ts
optional enableV2ManifestPaths: boolean; optional enableV2ManifestPaths: boolean;
@@ -41,6 +45,10 @@ turning this on will make the dataset unreadable for older versions
of LanceDB (prior to 0.10.0). To migrate an existing dataset, instead of LanceDB (prior to 0.10.0). To migrate an existing dataset, instead
use the LocalTable#migrateManifestPathsV2 method. use the LocalTable#migrateManifestPathsV2 method.
#### Deprecated
Pass `new_table_enable_v2_manifest_paths` to storageOptions instead.
*** ***
### existOk ### existOk
@@ -90,17 +98,3 @@ Options already set on the connection will be inherited by the table,
but can be overridden here. but can be overridden here.
The available options are described at https://lancedb.github.io/lancedb/guides/storage/ The available options are described at https://lancedb.github.io/lancedb/guides/storage/
***
### useLegacyFormat?
```ts
optional useLegacyFormat: boolean;
```
If true then data files will be written with the legacy format
The default is false.
Deprecated. Use data storage version instead.

View File

@@ -0,0 +1,37 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FragmentStatistics
# Interface: FragmentStatistics
## Properties
### lengths
```ts
lengths: FragmentSummaryStats;
```
Statistics on the number of rows in the table fragments
***
### numFragments
```ts
numFragments: number;
```
The number of fragments in the table
***
### numSmallFragments
```ts
numSmallFragments: number;
```
The number of uncompacted fragments in the table

View File

@@ -0,0 +1,77 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FragmentSummaryStats
# Interface: FragmentSummaryStats
## Properties
### max
```ts
max: number;
```
The number of rows in the fragment with the most rows
***
### mean
```ts
mean: number;
```
The mean number of rows in the fragments
***
### min
```ts
min: number;
```
The number of rows in the fragment with the fewest rows
***
### p25
```ts
p25: number;
```
The 25th percentile of number of rows in the fragments
***
### p50
```ts
p50: number;
```
The 50th percentile of number of rows in the fragments
***
### p75
```ts
p75: number;
```
The 75th percentile of number of rows in the fragments
***
### p99
```ts
p99: number;
```
The 99th percentile of number of rows in the fragments

View File

@@ -0,0 +1,103 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FtsOptions
# Interface: FtsOptions
Options to create a full text search index
## Properties
### asciiFolding?
```ts
optional asciiFolding: boolean;
```
whether to remove punctuation
***
### baseTokenizer?
```ts
optional baseTokenizer: "raw" | "simple" | "whitespace";
```
The tokenizer to use when building the index.
The default is "simple".
The following tokenizers are available:
"simple" - Simple tokenizer. This tokenizer splits the text into tokens using whitespace and punctuation as a delimiter.
"whitespace" - Whitespace tokenizer. This tokenizer splits the text into tokens using whitespace as a delimiter.
"raw" - Raw tokenizer. This tokenizer does not split the text into tokens and indexes the entire text as a single token.
***
### language?
```ts
optional language: string;
```
language for stemming and stop words
this is only used when `stem` or `remove_stop_words` is true
***
### lowercase?
```ts
optional lowercase: boolean;
```
whether to lowercase tokens
***
### maxTokenLength?
```ts
optional maxTokenLength: number;
```
maximum token length
tokens longer than this length will be ignored
***
### removeStopWords?
```ts
optional removeStopWords: boolean;
```
whether to remove stop words
***
### stem?
```ts
optional stem: boolean;
```
whether to stem tokens
***
### withPosition?
```ts
optional withPosition: boolean;
```
Whether to build the index with positions.
True by default.
If set to false, the index will not store the positions of the tokens in the text,
which will make the index smaller and faster to build, but will not support phrase queries.

View File

@@ -0,0 +1,25 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FullTextQuery
# Interface: FullTextQuery
Represents a full-text query interface.
This interface defines the structure and behavior for full-text queries,
including methods to retrieve the query type and convert the query to a dictionary format.
## Methods
### queryType()
```ts
queryType(): FullTextQueryType
```
The type of the full-text query.
#### Returns
[`FullTextQueryType`](../enumerations/FullTextQueryType.md)

View File

@@ -0,0 +1,22 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FullTextSearchOptions
# Interface: FullTextSearchOptions
Options that control the behavior of a full text search
## Properties
### columns?
```ts
optional columns: string | string[];
```
The columns to search
If not specified, all indexed columns will be searched.
For now, only one column can be searched.

View File

@@ -0,0 +1,149 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / HnswPqOptions
# Interface: HnswPqOptions
Options to create an `HNSW_PQ` index
## Properties
### distanceType?
```ts
optional distanceType: "l2" | "cosine" | "dot";
```
The distance metric used to train the index.
Default value is "l2".
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric that
accounts for both magnitude and direction when determining the distance
between vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metric
calculated from the cosine similarity between two vectors. Cosine
similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is defined to equal the cosine of the angle
between them. Unlike l2, the cosine distance is not affected by the
magnitude of the vectors. Cosine distance has a range of [0, 2].
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
l2 norm is 1), then dot distance is equivalent to the cosine distance.
***
### efConstruction?
```ts
optional efConstruction: number;
```
The number of candidates to evaluate during the construction of the HNSW graph.
The default value is 300.
This value controls the tradeoff between build speed and accuracy.
The higher the value the more accurate the build but the slower it will be.
150 to 300 is the typical range. 100 is a minimum for good quality search
results. In most cases, there is no benefit to setting this higher than 500.
This value should be set to a value that is not less than `ef` in the search phase.
***
### m?
```ts
optional m: number;
```
The number of neighbors to select for each vector in the HNSW graph.
The default value is 20.
This value controls the tradeoff between search speed and accuracy.
The higher the value the more accurate the search but the slower it will be.
***
### maxIterations?
```ts
optional maxIterations: number;
```
Max iterations to train kmeans.
The default value is 50.
When training an IVF index we use kmeans to calculate the partitions. This parameter
controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most cases the parameter
is unused because kmeans will converge with fewer iterations. The parameter is only
used in cases where kmeans does not appear to converge. In those cases it is unlikely
that setting this larger will lead to the index converging anyways.
***
### numPartitions?
```ts
optional numPartitions: number;
```
The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 works
well for most tables. For very large tables, training just one HNSW graph
will require too much memory. Each partition becomes its own HNSW graph, so
setting this value higher reduces the peak memory use of training.
***
### numSubVectors?
```ts
optional numSubVectors: number;
```
Number of sub-vectors of PQ.
This value controls how much the vector is compressed during the quantization step.
The more sub vectors there are the less the vector is compressed. The default is
the dimension of the vector divided by 16. If the dimension is not evenly divisible
by 16 we use the dimension divded by 8.
The above two cases are highly preferred. Having 8 or 16 values per subvector allows
us to use efficient SIMD instructions.
If the dimension is not visible by 8 then we use 1 subvector. This is not ideal and
will likely result in poor performance.
***
### sampleRate?
```ts
optional sampleRate: number;
```
The rate used to calculate the number of training vectors for kmeans.
Default value is 256.
When an IVF index is trained, we need to calculate partitions. These are groups
of vectors that are similar to each other. To do this we use an algorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a
random sample of the data. This parameter controls the size of the sample. The total
number of vectors used to train the index is `sample_rate * num_partitions`.
Increasing this value might improve the quality of the index but in most cases the
default should be sufficient.

View File

@@ -0,0 +1,128 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / HnswSqOptions
# Interface: HnswSqOptions
Options to create an `HNSW_SQ` index
## Properties
### distanceType?
```ts
optional distanceType: "l2" | "cosine" | "dot";
```
The distance metric used to train the index.
Default value is "l2".
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric that
accounts for both magnitude and direction when determining the distance
between vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metric
calculated from the cosine similarity between two vectors. Cosine
similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is defined to equal the cosine of the angle
between them. Unlike l2, the cosine distance is not affected by the
magnitude of the vectors. Cosine distance has a range of [0, 2].
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
l2 norm is 1), then dot distance is equivalent to the cosine distance.
***
### efConstruction?
```ts
optional efConstruction: number;
```
The number of candidates to evaluate during the construction of the HNSW graph.
The default value is 300.
This value controls the tradeoff between build speed and accuracy.
The higher the value the more accurate the build but the slower it will be.
150 to 300 is the typical range. 100 is a minimum for good quality search
results. In most cases, there is no benefit to setting this higher than 500.
This value should be set to a value that is not less than `ef` in the search phase.
***
### m?
```ts
optional m: number;
```
The number of neighbors to select for each vector in the HNSW graph.
The default value is 20.
This value controls the tradeoff between search speed and accuracy.
The higher the value the more accurate the search but the slower it will be.
***
### maxIterations?
```ts
optional maxIterations: number;
```
Max iterations to train kmeans.
The default value is 50.
When training an IVF index we use kmeans to calculate the partitions. This parameter
controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most cases the parameter
is unused because kmeans will converge with fewer iterations. The parameter is only
used in cases where kmeans does not appear to converge. In those cases it is unlikely
that setting this larger will lead to the index converging anyways.
***
### numPartitions?
```ts
optional numPartitions: number;
```
The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1 works
well for most tables. For very large tables, training just one HNSW graph
will require too much memory. Each partition becomes its own HNSW graph, so
setting this value higher reduces the peak memory use of training.
***
### sampleRate?
```ts
optional sampleRate: number;
```
The rate used to calculate the number of training vectors for kmeans.
Default value is 256.
When an IVF index is trained, we need to calculate partitions. These are groups
of vectors that are similar to each other. To do this we use an algorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a
random sample of the data. This parameter controls the size of the sample. The total
number of vectors used to train the index is `sample_rate * num_partitions`.
Increasing this value might improve the quality of the index but in most cases the
default should be sufficient.

View File

@@ -39,3 +39,11 @@ and the same name, then an error will be returned. This is true even if
that index is out of date. that index is out of date.
The default is true The default is true
***
### waitTimeoutSeconds?
```ts
optional waitTimeoutSeconds: number;
```

View File

@@ -30,6 +30,17 @@ The type of the index
*** ***
### loss?
```ts
optional loss: number;
```
The KMeans loss value of the index,
it is only present for vector indices.
***
### numIndexedRows ### numIndexedRows
```ts ```ts

View File

@@ -0,0 +1,112 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / IvfFlatOptions
# Interface: IvfFlatOptions
Options to create an `IVF_FLAT` index
## Properties
### distanceType?
```ts
optional distanceType: "l2" | "cosine" | "dot" | "hamming";
```
Distance type to use to build the index.
Default value is "l2".
This is used when training the index to calculate the IVF partitions
(vectors are grouped in partitions with similar vectors according to this
distance type).
The distance type used to train an index MUST match the distance type used
to search the index. Failure to do so will yield inaccurate results.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric that
accounts for both magnitude and direction when determining the distance
between vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metric
calculated from the cosine similarity between two vectors. Cosine
similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is defined to equal the cosine of the angle
between them. Unlike l2, the cosine distance is not affected by the
magnitude of the vectors. Cosine distance has a range of [0, 2].
Note: the cosine distance is undefined when one (or both) of the vectors
are all zeros (there is no direction). These vectors are invalid and may
never be returned from a vector search.
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
l2 norm is 1), then dot distance is equivalent to the cosine distance.
"hamming" - Hamming distance. Hamming distance is a distance metric
calculated from the number of bits that are different between two vectors.
Hamming distance has a range of [0, dimension]. Note that the hamming distance
is only valid for binary vectors.
***
### maxIterations?
```ts
optional maxIterations: number;
```
Max iteration to train IVF kmeans.
When training an IVF FLAT index we use kmeans to calculate the partitions. This parameter
controls how many iterations of kmeans to run.
Increasing this might improve the quality of the index but in most cases these extra
iterations have diminishing returns.
The default value is 50.
***
### numPartitions?
```ts
optional numPartitions: number;
```
The number of IVF partitions to create.
This value should generally scale with the number of rows in the dataset.
By default the number of partitions is the square root of the number of
rows.
If this value is too large then the first part of the search (picking the
right partition) will be slow. If this value is too small then the second
part of the search (searching within a partition) will be slow.
***
### sampleRate?
```ts
optional sampleRate: number;
```
The number of vectors, per partition, to sample when training IVF kmeans.
When an IVF FLAT index is trained, we need to calculate partitions. These are groups
of vectors that are similar to each other. To do this we use an algorithm called kmeans.
Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a
random sample of the data. This parameter controls the size of the sample. The total
number of vectors used to train the index is `sample_rate * num_partitions`.
Increasing this value might improve the quality of the index but in most cases the
default should be sufficient.
The default value is 256.

View File

@@ -31,13 +31,13 @@ The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric that "l2" - Euclidean distance. This is a very common distance metric that
accounts for both magnitude and direction when determining the distance accounts for both magnitude and direction when determining the distance
between vectors. L2 distance has a range of [0, ∞). between vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metric "cosine" - Cosine distance. Cosine distance is a distance metric
calculated from the cosine similarity between two vectors. Cosine calculated from the cosine similarity between two vectors. Cosine
similarity is a measure of similarity between two non-zero vectors of an similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is defined to equal the cosine of the angle inner product space. It is defined to equal the cosine of the angle
between them. Unlike L2, the cosine distance is not affected by the between them. Unlike l2, the cosine distance is not affected by the
magnitude of the vectors. Cosine distance has a range of [0, 2]. magnitude of the vectors. Cosine distance has a range of [0, 2].
Note: the cosine distance is undefined when one (or both) of the vectors Note: the cosine distance is undefined when one (or both) of the vectors
@@ -46,7 +46,7 @@ never be returned from a vector search.
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot "dot" - Dot product. Dot distance is the dot product of two vectors. Dot
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
L2 norm is 1), then dot distance is equivalent to the cosine distance. l2 norm is 1), then dot distance is equivalent to the cosine distance.
*** ***
@@ -68,6 +68,21 @@ The default value is 50.
*** ***
### numBits?
```ts
optional numBits: number;
```
Number of bits per sub-vector.
This value controls how much each subvector is compressed. The more bits the more
accurate the index will be but the slower search. The default is 8 bits.
The number of bits must be 4 or 8.
***
### numPartitions? ### numPartitions?
```ts ```ts

View File

@@ -0,0 +1,31 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / MergeStats
# Interface: MergeStats
## Properties
### numDeletedRows
```ts
numDeletedRows: bigint;
```
***
### numInsertedRows
```ts
numInsertedRows: bigint;
```
***
### numUpdatedRows
```ts
numUpdatedRows: bigint;
```

View File

@@ -0,0 +1,40 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / OpenTableOptions
# Interface: OpenTableOptions
## Properties
### indexCacheSize?
```ts
optional indexCacheSize: number;
```
Set the size of the index cache, specified as a number of entries
The exact meaning of an "entry" will depend on the type of index:
- IVF: there is one entry for each IVF partition
- BTREE: there is one entry for the entire index
This cache applies to the entire opened table, across all indices.
Setting this value higher will increase performance on larger datasets
at the expense of more RAM
***
### storageOptions?
```ts
optional storageOptions: Record<string, string>;
```
Configuration for object storage.
Options already set on the connection will be inherited by the table,
but can be overridden here.
The available options are described at https://lancedb.github.io/lancedb/guides/storage/

Some files were not shown because too many files have changed in this diff Show More