Compare commits

..

15 Commits

Author SHA1 Message Date
Lance Release
a1a0472f3f Bump version: 0.22.1-beta.0 → 0.22.1-beta.1 2025-04-29 22:18:53 +00:00
Wyatt Alt
3425a6d339 feat: upgrade lance to v0.27.0-beta.2 (#2364)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Chores**
- Updated dependencies for related components to use the latest version
from a specific repository source. No changes to features or public
functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-29 14:59:56 -07:00
Ryan Green
af54e0ce06 feat: add table stats API (#2363)
* Add a new "table stats" API to expose basic table and fragment
statistics with local and remote table implementations

### Questions
* This is using `calculate_data_stats` to determine total bytes in the
table. This seems like a potentially expensive operation - are there any
concerns about performance for large datasets?

### Notes
* bytes_on_disk seems to be stored at the column level but there does
not seem to be a way to easily calculate total bytes per fragment. This
may need to be added in lance before we can support fragment size
(bytes) statistics.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added a method to retrieve comprehensive table statistics, including
total rows, index counts, storage size, and detailed fragment size
metrics such as minimum, maximum, mean, and percentiles.
- Enabled fetching of table statistics from remote sources through
asynchronous requests.
- Extended table interfaces across Python, Rust, and Node.js to support
synchronous and asynchronous retrieval of table statistics.
- **Tests**
- Introduced tests to verify the accuracy of the new table statistics
feature for both populated and empty tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-29 15:19:08 -02:30
Lance Release
089905fe8f Updating package-lock.json 2025-04-28 19:13:36 +00:00
Lance Release
554939e5d2 Updating package-lock.json 2025-04-28 17:20:58 +00:00
Lance Release
7a13814922 Updating package-lock.json 2025-04-28 17:20:42 +00:00
Lance Release
e9f25f6a12 Bump version: 0.19.0 → 0.19.1-beta.0 2025-04-28 17:20:26 +00:00
Lance Release
419a433244 Bump version: 0.22.0 → 0.22.1-beta.0 2025-04-28 17:20:10 +00:00
LuQQiu
a9311c4dc0 feat: add list/create/delete/update/checkout tag API (#2353)
add the tag related API to list existing tags, attach tag to a version,
update the tag version, delete tag, get the version of the tag, and
checkout the version that the tag bounded to.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Introduced table version tagging, allowing users to create, update,
delete, and list human-readable tags for specific table versions.
  - Enabled checking out a table by either version number or tag name.
- Added new interfaces for tag management in both Python and Node.js
APIs, supporting synchronous and asynchronous workflows.

- **Bug Fixes**
  - None.

- **Documentation**
- Updated documentation to describe the new tagging features, including
usage examples.

- **Tests**
- Added comprehensive tests for tag creation, updating, deletion,
listing, and version checkout by tag in both Python and Node.js
environments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-28 10:04:46 -07:00
LuQQiu
178bcf9c90 fix: hybrid search explain plan analyze plan (#2360)
Fix hybrid search explain plan analyze plan API

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Added options to view the execution plan and analyze the runtime
performance of hybrid queries.
- **Refactor**
- Improved internal handling of query setup for better modularity and
maintainability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-04-27 18:39:43 -07:00
Lance Release
b9be092cb1 Updating package-lock.json 2025-04-25 22:05:57 +00:00
Lance Release
e8c0c52315 Updating package-lock.json 2025-04-25 21:17:03 +00:00
Lance Release
a60fa0d3b7 Updating package-lock.json 2025-04-25 21:16:48 +00:00
Lance Release
726d629b9b Bump version: 0.19.0-beta.12 → 0.19.0 2025-04-25 21:16:30 +00:00
Lance Release
b493f56dee Bump version: 0.19.0-beta.11 → 0.19.0-beta.12 2025-04-25 21:16:25 +00:00
43 changed files with 2182 additions and 310 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.19.0-beta.11"
current_version = "0.19.1-beta.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

330
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -21,14 +21,14 @@ categories = ["database-implementations"]
rust-version = "1.78.0"
[workspace.dependencies]
lance = { "version" = "=0.26.0", "features" = ["dynamodb"] }
lance-io = "=0.26.0"
lance-index = "=0.26.0"
lance-linalg = "=0.26.0"
lance-table = "=0.26.0"
lance-testing = "=0.26.0"
lance-datafusion = "=0.26.0"
lance-encoding = "=0.26.0"
lance = { "version" = "=0.27.0", "features" = ["dynamodb"], tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-io = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-index = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-linalg = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-table = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-testing = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-datafusion = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
lance-encoding = { version = "=0.27.0", tag = "v0.27.0-beta.2", git="https://github.com/lancedb/lance.git" }
# Note that this one does not include pyarrow
arrow = { version = "54.1", optional = false }
arrow-array = "54.1"

View File

@@ -117,8 +117,8 @@ wish to return to standard mode, call `checkoutLatest`.
#### Parameters
* **version**: `number`
The version to checkout
* **version**: `string` \| `number`
The version to checkout, could be version number or tag
#### Returns
@@ -615,6 +615,50 @@ of the given query
***
### stats()
```ts
abstract stats(): Promise<TableStatistics>
```
Returns table and fragment statistics
#### Returns
`Promise`&lt;[`TableStatistics`](../interfaces/TableStatistics.md)&gt;
The table and fragment statistics
***
### tags()
```ts
abstract tags(): Promise<Tags>
```
Get a tags manager for this table.
Tags allow you to label specific versions of a table with a human-readable name.
The returned tags manager can be used to list, create, update, or delete tags.
#### Returns
`Promise`&lt;[`Tags`](Tags.md)&gt;
A tags manager for this table
#### Example
```typescript
const tagsManager = await table.tags();
await tagsManager.create("v1", 1);
const tags = await tagsManager.list();
console.log(tags); // { "v1": { version: 1, manifestSize: ... } }
```
***
### toArrow()
```ts

View File

@@ -0,0 +1,35 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / TagContents
# Class: TagContents
## Constructors
### new TagContents()
```ts
new TagContents(): TagContents
```
#### Returns
[`TagContents`](TagContents.md)
## Properties
### manifestSize
```ts
manifestSize: number;
```
***
### version
```ts
version: number;
```

View File

@@ -0,0 +1,99 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / Tags
# Class: Tags
## Constructors
### new Tags()
```ts
new Tags(): Tags
```
#### Returns
[`Tags`](Tags.md)
## Methods
### create()
```ts
create(tag, version): Promise<void>
```
#### Parameters
* **tag**: `string`
* **version**: `number`
#### Returns
`Promise`&lt;`void`&gt;
***
### delete()
```ts
delete(tag): Promise<void>
```
#### Parameters
* **tag**: `string`
#### Returns
`Promise`&lt;`void`&gt;
***
### getVersion()
```ts
getVersion(tag): Promise<number>
```
#### Parameters
* **tag**: `string`
#### Returns
`Promise`&lt;`number`&gt;
***
### list()
```ts
list(): Promise<Record<string, TagContents>>
```
#### Returns
`Promise`&lt;`Record`&lt;`string`, [`TagContents`](TagContents.md)&gt;&gt;
***
### update()
```ts
update(tag, version): Promise<void>
```
#### Parameters
* **tag**: `string`
* **version**: `number`
#### Returns
`Promise`&lt;`void`&gt;

View File

@@ -27,6 +27,8 @@
- [QueryBase](classes/QueryBase.md)
- [RecordBatchIterator](classes/RecordBatchIterator.md)
- [Table](classes/Table.md)
- [TagContents](classes/TagContents.md)
- [Tags](classes/Tags.md)
- [VectorColumnOptions](classes/VectorColumnOptions.md)
- [VectorQuery](classes/VectorQuery.md)
@@ -40,6 +42,8 @@
- [ConnectionOptions](interfaces/ConnectionOptions.md)
- [CreateTableOptions](interfaces/CreateTableOptions.md)
- [ExecutableQuery](interfaces/ExecutableQuery.md)
- [FragmentStatistics](interfaces/FragmentStatistics.md)
- [FragmentSummaryStats](interfaces/FragmentSummaryStats.md)
- [FtsOptions](interfaces/FtsOptions.md)
- [FullTextQuery](interfaces/FullTextQuery.md)
- [FullTextSearchOptions](interfaces/FullTextSearchOptions.md)
@@ -57,6 +61,7 @@
- [RemovalStats](interfaces/RemovalStats.md)
- [RetryConfig](interfaces/RetryConfig.md)
- [TableNamesOptions](interfaces/TableNamesOptions.md)
- [TableStatistics](interfaces/TableStatistics.md)
- [TimeoutConfig](interfaces/TimeoutConfig.md)
- [UpdateOptions](interfaces/UpdateOptions.md)
- [Version](interfaces/Version.md)

View File

@@ -0,0 +1,37 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FragmentStatistics
# Interface: FragmentStatistics
## Properties
### lengths
```ts
lengths: FragmentSummaryStats;
```
Statistics on the number of rows in the table fragments
***
### numFragments
```ts
numFragments: number;
```
The number of fragments in the table
***
### numSmallFragments
```ts
numSmallFragments: number;
```
The number of uncompacted fragments in the table

View File

@@ -0,0 +1,77 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FragmentSummaryStats
# Interface: FragmentSummaryStats
## Properties
### max
```ts
max: number;
```
The number of rows in the fragment with the most rows
***
### mean
```ts
mean: number;
```
The mean number of rows in the fragments
***
### min
```ts
min: number;
```
The number of rows in the fragment with the fewest rows
***
### p25
```ts
p25: number;
```
The 25th percentile of number of rows in the fragments
***
### p50
```ts
p50: number;
```
The 50th percentile of number of rows in the fragments
***
### p75
```ts
p75: number;
```
The 75th percentile of number of rows in the fragments
***
### p99
```ts
p99: number;
```
The 99th percentile of number of rows in the fragments

View File

@@ -0,0 +1,47 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / TableStatistics
# Interface: TableStatistics
## Properties
### fragmentStats
```ts
fragmentStats: FragmentStatistics;
```
Statistics on table fragments
***
### numIndices
```ts
numIndices: number;
```
The number of indices in the table
***
### numRows
```ts
numRows: number;
```
The number of rows in the table
***
### totalBytes
```ts
totalBytes: number;
```
The total number of bytes in the table

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.19.0-beta.11</version>
<version>0.19.1-beta.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.19.0-beta.11</version>
<version>0.19.1-beta.0</version>
<packaging>pom</packaging>
<name>LanceDB Parent</name>

44
node/package-lock.json generated
View File

@@ -1,12 +1,12 @@
{
"name": "vectordb",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "vectordb",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"cpu": [
"x64",
"arm64"
@@ -52,11 +52,11 @@
"uuid": "^9.0.0"
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.19.0-beta.11",
"@lancedb/vectordb-darwin-x64": "0.19.0-beta.11",
"@lancedb/vectordb-linux-arm64-gnu": "0.19.0-beta.11",
"@lancedb/vectordb-linux-x64-gnu": "0.19.0-beta.11",
"@lancedb/vectordb-win32-x64-msvc": "0.19.0-beta.11"
"@lancedb/vectordb-darwin-arm64": "0.19.1-beta.0",
"@lancedb/vectordb-darwin-x64": "0.19.1-beta.0",
"@lancedb/vectordb-linux-arm64-gnu": "0.19.1-beta.0",
"@lancedb/vectordb-linux-x64-gnu": "0.19.1-beta.0",
"@lancedb/vectordb-win32-x64-msvc": "0.19.1-beta.0"
},
"peerDependencies": {
"@apache-arrow/ts": "^14.0.2",
@@ -327,9 +327,9 @@
}
},
"node_modules/@lancedb/vectordb-darwin-arm64": {
"version": "0.19.0-beta.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-arm64/-/vectordb-darwin-arm64-0.19.0-beta.11.tgz",
"integrity": "sha512-fLefGJYdlIRIjrJj8MU1r5Zix5LpKktpCYilA7tZrfvBWkubGceJ+U6RPsWz7VGBfWcETo3g5CBooUPhbtSMlQ==",
"version": "0.19.1-beta.0",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-arm64/-/vectordb-darwin-arm64-0.19.1-beta.0.tgz",
"integrity": "sha512-LXdyFLniwciIslw487JRmYB+PpuwBhq4jT9hNxKKMrnXfBvUEQ8HN9kxkLFy+pwFaY1wmFH0EfujnXlt8qXgqA==",
"cpu": [
"arm64"
],
@@ -340,9 +340,9 @@
]
},
"node_modules/@lancedb/vectordb-darwin-x64": {
"version": "0.19.0-beta.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-x64/-/vectordb-darwin-x64-0.19.0-beta.11.tgz",
"integrity": "sha512-FkCa1TbPLDXAGhlRI4tafcltzApCsyvgi+I+kX07u5DKPNQVALpQ3R6X6GLlbiFsAFBdyv9t2fqQ9DlgjJIZpA==",
"version": "0.19.1-beta.0",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-x64/-/vectordb-darwin-x64-0.19.1-beta.0.tgz",
"integrity": "sha512-kVZzc6j778vg3AftFdu8pSg/3J9YHCID70bf3QpMnl43b9ZSl4Sh8WNVYWiqfwOAElD+kOaTq9MC4f7jkW/DEQ==",
"cpu": [
"x64"
],
@@ -353,9 +353,9 @@
]
},
"node_modules/@lancedb/vectordb-linux-arm64-gnu": {
"version": "0.19.0-beta.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-arm64-gnu/-/vectordb-linux-arm64-gnu-0.19.0-beta.11.tgz",
"integrity": "sha512-iZkL/01HNUNQ8pGK0+hoNyrM7P1YtShsyIQVzJMfo41SAofCBf9qvi9YaYyd49sDb+dQXeRn1+cfaJ9siz1OHw==",
"version": "0.19.1-beta.0",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-arm64-gnu/-/vectordb-linux-arm64-gnu-0.19.1-beta.0.tgz",
"integrity": "sha512-EOrAwqUkCXBp7GVwzhyUoFIUbV+7eXDbEHo6mHgZN6E1SOAgBuPiwRqDlqA4uCYU8YhZJZabXdiLTP+EuSjpgw==",
"cpu": [
"arm64"
],
@@ -366,9 +366,9 @@
]
},
"node_modules/@lancedb/vectordb-linux-x64-gnu": {
"version": "0.19.0-beta.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-x64-gnu/-/vectordb-linux-x64-gnu-0.19.0-beta.11.tgz",
"integrity": "sha512-MdKRHxe2tRQqmExNLv3f6Wvx1mEi98eFtD0ysm4tNrQdaS1MJbTp+DUehrRKkfDWsooalHkIi9d02BVw5qseUQ==",
"version": "0.19.1-beta.0",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-x64-gnu/-/vectordb-linux-x64-gnu-0.19.1-beta.0.tgz",
"integrity": "sha512-Xs7wlM6LJS4/jps9Vi1TMtR9kYedzShZQ1sepJzn2cuI1jnbCrlLsFSMUfjhxy7/0+E8QO1f7cBO6gwOTunrog==",
"cpu": [
"x64"
],
@@ -379,9 +379,9 @@
]
},
"node_modules/@lancedb/vectordb-win32-x64-msvc": {
"version": "0.19.0-beta.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-win32-x64-msvc/-/vectordb-win32-x64-msvc-0.19.0-beta.11.tgz",
"integrity": "sha512-KWy+t9jr0feJAW9NkmM/w9kfdpp78+7mkeh9lb0g3xI3OdYU1yizNqFjbIQqJf7/L4sou4wmOjAC+FcP8qCtzg==",
"version": "0.19.1-beta.0",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-win32-x64-msvc/-/vectordb-win32-x64-msvc-0.19.1-beta.0.tgz",
"integrity": "sha512-tS9g2S4hxdsQfVd6S6N+HuX1Mt4iJEQjpHmmfz+WEfdc5JP6eyKoqqQzFw2oL3TJAG7HXfD88/24nCUUCHDKoA==",
"cpu": [
"x64"
],

View File

@@ -1,6 +1,6 @@
{
"name": "vectordb",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"description": " Serverless, low-latency vector database for AI applications",
"private": false,
"main": "dist/index.js",
@@ -89,10 +89,10 @@
}
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-x64": "0.19.0-beta.11",
"@lancedb/vectordb-darwin-arm64": "0.19.0-beta.11",
"@lancedb/vectordb-linux-x64-gnu": "0.19.0-beta.11",
"@lancedb/vectordb-linux-arm64-gnu": "0.19.0-beta.11",
"@lancedb/vectordb-win32-x64-msvc": "0.19.0-beta.11"
"@lancedb/vectordb-darwin-x64": "0.19.1-beta.0",
"@lancedb/vectordb-darwin-arm64": "0.19.1-beta.0",
"@lancedb/vectordb-linux-x64-gnu": "0.19.1-beta.0",
"@lancedb/vectordb-linux-arm64-gnu": "0.19.1-beta.0",
"@lancedb/vectordb-win32-x64-msvc": "0.19.1-beta.0"
}
}

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.19.0-beta.11"
version = "0.19.1-beta.0"
license.workspace = true
description.workspace = true
repository.workspace = true

View File

@@ -71,6 +71,29 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
await expect(table.countRows()).resolves.toBe(3);
});
it("should show table stats", async () => {
await table.add([{ id: 1 }, { id: 2 }]);
await table.add([{ id: 1 }]);
await expect(table.stats()).resolves.toEqual({
fragmentStats: {
lengths: {
max: 2,
mean: 1,
min: 1,
p25: 1,
p50: 2,
p75: 2,
p99: 2,
},
numFragments: 2,
numSmallFragments: 2,
},
numIndices: 0,
numRows: 3,
totalBytes: 24,
});
});
it("should overwrite data if asked", async () => {
await table.add([{ id: 1 }, { id: 2 }]);
await table.add([{ id: 1 }], { mode: "overwrite" });
@@ -1178,6 +1201,73 @@ describe("when dealing with versioning", () => {
});
});
describe("when dealing with tags", () => {
let tmpDir: tmp.DirResult;
beforeEach(() => {
tmpDir = tmp.dirSync({ unsafeCleanup: true });
});
afterEach(() => {
tmpDir.removeCallback();
});
it("can manage tags", async () => {
const conn = await connect(tmpDir.name, {
readConsistencyInterval: 0,
});
const table = await conn.createTable("my_table", [
{ id: 1n, vector: [0.1, 0.2] },
]);
expect(await table.version()).toBe(1);
await table.add([{ id: 2n, vector: [0.3, 0.4] }]);
expect(await table.version()).toBe(2);
const tagsManager = await table.tags();
const initialTags = await tagsManager.list();
expect(Object.keys(initialTags).length).toBe(0);
const tag1 = "tag1";
await tagsManager.create(tag1, 1);
expect(await tagsManager.getVersion(tag1)).toBe(1);
const tagsAfterFirst = await tagsManager.list();
expect(Object.keys(tagsAfterFirst).length).toBe(1);
expect(tagsAfterFirst).toHaveProperty(tag1);
expect(tagsAfterFirst[tag1].version).toBe(1);
await tagsManager.create("tag2", 2);
expect(await tagsManager.getVersion("tag2")).toBe(2);
const tagsAfterSecond = await tagsManager.list();
expect(Object.keys(tagsAfterSecond).length).toBe(2);
expect(tagsAfterSecond).toHaveProperty(tag1);
expect(tagsAfterSecond[tag1].version).toBe(1);
expect(tagsAfterSecond).toHaveProperty("tag2");
expect(tagsAfterSecond["tag2"].version).toBe(2);
await table.add([{ id: 3n, vector: [0.5, 0.6] }]);
await tagsManager.update(tag1, 3);
expect(await tagsManager.getVersion(tag1)).toBe(3);
await tagsManager.delete("tag2");
const tagsAfterDelete = await tagsManager.list();
expect(Object.keys(tagsAfterDelete).length).toBe(1);
expect(tagsAfterDelete).toHaveProperty(tag1);
expect(tagsAfterDelete[tag1].version).toBe(3);
await table.add([{ id: 4n, vector: [0.7, 0.8] }]);
expect(await table.version()).toBe(4);
await table.checkout(tag1);
expect(await table.version()).toBe(3);
await table.checkoutLatest();
expect(await table.version()).toBe(4);
});
});
describe("when optimizing a dataset", () => {
let tmpDir: tmp.DirResult;
let table: Table;

View File

@@ -23,6 +23,11 @@ export {
OptimizeStats,
CompactionStats,
RemovalStats,
TableStatistics,
FragmentStatistics,
FragmentSummaryStats,
Tags,
TagContents,
} from "./native.js";
export {

View File

@@ -20,6 +20,8 @@ import {
IndexConfig,
IndexStatistics,
OptimizeStats,
TableStatistics,
Tags,
Table as _NativeTable,
} from "./native";
import {
@@ -374,7 +376,7 @@ export abstract class Table {
*
* Calling this method will set the table into time-travel mode. If you
* wish to return to standard mode, call `checkoutLatest`.
* @param {number} version The version to checkout
* @param {number | string} version The version to checkout, could be version number or tag
* @example
* ```typescript
* import * as lancedb from "@lancedb/lancedb"
@@ -390,7 +392,8 @@ export abstract class Table {
* console.log(await table.version()); // 2
* ```
*/
abstract checkout(version: number): Promise<void>;
abstract checkout(version: number | string): Promise<void>;
/**
* Checkout the latest version of the table. _This is an in-place operation._
*
@@ -404,6 +407,23 @@ export abstract class Table {
*/
abstract listVersions(): Promise<Version[]>;
/**
* Get a tags manager for this table.
*
* Tags allow you to label specific versions of a table with a human-readable name.
* The returned tags manager can be used to list, create, update, or delete tags.
*
* @returns {Tags} A tags manager for this table
* @example
* ```typescript
* const tagsManager = await table.tags();
* await tagsManager.create("v1", 1);
* const tags = await tagsManager.list();
* console.log(tags); // { "v1": { version: 1, manifestSize: ... } }
* ```
*/
abstract tags(): Promise<Tags>;
/**
* Restore the table to the currently checked out version
*
@@ -463,6 +483,13 @@ export abstract class Table {
* Use {@link Table.listIndices} to find the names of the indices.
*/
abstract indexStats(name: string): Promise<IndexStatistics | undefined>;
/** Returns table and fragment statistics
*
* @returns {TableStatistics} The table and fragment statistics
*
*/
abstract stats(): Promise<TableStatistics>;
}
export class LocalTable extends Table {
@@ -699,8 +726,11 @@ export class LocalTable extends Table {
return await this.inner.version();
}
async checkout(version: number): Promise<void> {
await this.inner.checkout(version);
async checkout(version: number | string): Promise<void> {
if (typeof version === "string") {
return this.inner.checkoutTag(version);
}
return this.inner.checkout(version);
}
async checkoutLatest(): Promise<void> {
@@ -719,6 +749,10 @@ export class LocalTable extends Table {
await this.inner.restore();
}
async tags(): Promise<Tags> {
return await this.inner.tags();
}
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> {
let cleanupOlderThanMs;
if (
@@ -749,6 +783,11 @@ export class LocalTable extends Table {
}
return stats;
}
async stats(): Promise<TableStatistics> {
return await this.inner.stats();
}
mergeInsert(on: string | string[]): MergeInsertBuilder {
on = Array.isArray(on) ? on : [on];
return new MergeInsertBuilder(this.inner.mergeInsert(on), this.schema());

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-x64",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["darwin"],
"cpu": ["x64"],
"main": "lancedb.darwin-x64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.19.0-beta.11",
"version": "0.19.1-beta.0",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -157,6 +157,12 @@ impl Table {
.default_error()
}
#[napi(catch_unwind)]
pub async fn stats(&self) -> Result<TableStatistics> {
let stats = self.inner_ref()?.stats().await.default_error()?;
Ok(stats.into())
}
#[napi(catch_unwind)]
pub async fn update(
&self,
@@ -249,6 +255,14 @@ impl Table {
.default_error()
}
#[napi(catch_unwind)]
pub async fn checkout_tag(&self, tag: String) -> napi::Result<()> {
self.inner_ref()?
.checkout_tag(tag.as_str())
.await
.default_error()
}
#[napi(catch_unwind)]
pub async fn checkout_latest(&self) -> napi::Result<()> {
self.inner_ref()?.checkout_latest().await.default_error()
@@ -281,6 +295,13 @@ impl Table {
self.inner_ref()?.restore().await.default_error()
}
#[napi(catch_unwind)]
pub async fn tags(&self) -> napi::Result<Tags> {
Ok(Tags {
inner: self.inner_ref()?.clone(),
})
}
#[napi(catch_unwind)]
pub async fn optimize(
&self,
@@ -540,9 +561,158 @@ impl From<lancedb::index::IndexStatistics> for IndexStatistics {
}
}
#[napi(object)]
pub struct TableStatistics {
/// The total number of bytes in the table
pub total_bytes: i64,
/// The number of rows in the table
pub num_rows: i64,
/// The number of indices in the table
pub num_indices: i64,
/// Statistics on table fragments
pub fragment_stats: FragmentStatistics,
}
#[napi(object)]
pub struct FragmentStatistics {
/// The number of fragments in the table
pub num_fragments: i64,
/// The number of uncompacted fragments in the table
pub num_small_fragments: i64,
/// Statistics on the number of rows in the table fragments
pub lengths: FragmentSummaryStats,
}
#[napi(object)]
pub struct FragmentSummaryStats {
/// The number of rows in the fragment with the fewest rows
pub min: i64,
/// The number of rows in the fragment with the most rows
pub max: i64,
/// The mean number of rows in the fragments
pub mean: i64,
/// The 25th percentile of number of rows in the fragments
pub p25: i64,
/// The 50th percentile of number of rows in the fragments
pub p50: i64,
/// The 75th percentile of number of rows in the fragments
pub p75: i64,
/// The 99th percentile of number of rows in the fragments
pub p99: i64,
}
impl From<lancedb::table::TableStatistics> for TableStatistics {
fn from(v: lancedb::table::TableStatistics) -> Self {
Self {
total_bytes: v.total_bytes as i64,
num_rows: v.num_rows as i64,
num_indices: v.num_indices as i64,
fragment_stats: FragmentStatistics {
num_fragments: v.fragment_stats.num_fragments as i64,
num_small_fragments: v.fragment_stats.num_small_fragments as i64,
lengths: FragmentSummaryStats {
min: v.fragment_stats.lengths.min as i64,
max: v.fragment_stats.lengths.max as i64,
mean: v.fragment_stats.lengths.mean as i64,
p25: v.fragment_stats.lengths.p25 as i64,
p50: v.fragment_stats.lengths.p50 as i64,
p75: v.fragment_stats.lengths.p75 as i64,
p99: v.fragment_stats.lengths.p99 as i64,
},
},
}
}
}
#[napi(object)]
pub struct Version {
pub version: i64,
pub timestamp: i64,
pub metadata: HashMap<String, String>,
}
#[napi]
pub struct TagContents {
pub version: i64,
pub manifest_size: i64,
}
#[napi]
pub struct Tags {
inner: LanceDbTable,
}
#[napi]
impl Tags {
#[napi]
pub async fn list(&self) -> napi::Result<HashMap<String, TagContents>> {
let rust_tags = self.inner.tags().await.default_error()?;
let tag_list = rust_tags.as_ref().list().await.default_error()?;
let tag_contents = tag_list
.into_iter()
.map(|(k, v)| {
(
k,
TagContents {
version: v.version as i64,
manifest_size: v.manifest_size as i64,
},
)
})
.collect();
Ok(tag_contents)
}
#[napi]
pub async fn get_version(&self, tag: String) -> napi::Result<i64> {
let rust_tags = self.inner.tags().await.default_error()?;
rust_tags
.as_ref()
.get_version(tag.as_str())
.await
.map(|v| v as i64)
.default_error()
}
#[napi]
pub async unsafe fn create(&mut self, tag: String, version: i64) -> napi::Result<()> {
let mut rust_tags = self.inner.tags().await.default_error()?;
rust_tags
.as_mut()
.create(tag.as_str(), version as u64)
.await
.default_error()
}
#[napi]
pub async unsafe fn delete(&mut self, tag: String) -> napi::Result<()> {
let mut rust_tags = self.inner.tags().await.default_error()?;
rust_tags
.as_mut()
.delete(tag.as_str())
.await
.default_error()
}
#[napi]
pub async unsafe fn update(&mut self, tag: String, version: i64) -> napi::Result<()> {
let mut rust_tags = self.inner.tags().await.default_error()?;
rust_tags
.as_mut()
.update(tag.as_str(), version as u64)
.await
.default_error()
}
}

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.22.0"
current_version = "0.22.1-beta.1"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.22.0"
version = "0.22.1-beta.1"
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true

View File

@@ -1,5 +1,5 @@
from datetime import timedelta
from typing import Dict, List, Optional, Tuple, Any, Union, Literal
from typing import Dict, List, Optional, Tuple, Any, TypedDict, Union, Literal
import pyarrow as pa
@@ -47,7 +47,7 @@ class Table:
): ...
async def list_versions(self) -> List[Dict[str, Any]]: ...
async def version(self) -> int: ...
async def checkout(self, version: int): ...
async def checkout(self, version: Union[int, str]): ...
async def checkout_latest(self): ...
async def restore(self, version: Optional[int] = None): ...
async def list_indices(self) -> list[IndexConfig]: ...
@@ -61,9 +61,18 @@ class Table:
cleanup_since_ms: Optional[int] = None,
delete_unverified: Optional[bool] = None,
) -> OptimizeStats: ...
@property
def tags(self) -> Tags: ...
def query(self) -> Query: ...
def vector_search(self) -> VectorQuery: ...
class Tags:
async def list(self) -> Dict[str, Tag]: ...
async def get_version(self, tag: str) -> int: ...
async def create(self, tag: str, version: int): ...
async def delete(self, tag: str): ...
async def update(self, tag: str, version: int): ...
class IndexConfig:
index_type: str
columns: List[str]
@@ -195,3 +204,7 @@ class RemovalStats:
class OptimizeStats:
compaction: CompactionStats
prune: RemovalStats
class Tag(TypedDict):
version: int
manifest_size: int

View File

@@ -1636,51 +1636,7 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
raise NotImplementedError("to_query_object not yet supported on a hybrid query")
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
vector_query, fts_query = self._validate_query(
self._query, self._vector, self._text
)
self._fts_query = LanceFtsQueryBuilder(
self._table, fts_query, fts_columns=self._fts_columns
)
vector_query = self._query_to_vector(
self._table, vector_query, self._vector_column
)
self._vector_query = LanceVectorQueryBuilder(
self._table, vector_query, self._vector_column
)
if self._limit:
self._vector_query.limit(self._limit)
self._fts_query.limit(self._limit)
if self._columns:
self._vector_query.select(self._columns)
self._fts_query.select(self._columns)
if self._where:
self._vector_query.where(self._where, self._postfilter)
self._fts_query.where(self._where, self._postfilter)
if self._with_row_id:
self._vector_query.with_row_id(True)
self._fts_query.with_row_id(True)
if self._phrase_query:
self._fts_query.phrase_query(True)
if self._distance_type:
self._vector_query.metric(self._distance_type)
if self._nprobes:
self._vector_query.nprobes(self._nprobes)
if self._refine_factor:
self._vector_query.refine_factor(self._refine_factor)
if self._ef:
self._vector_query.ef(self._ef)
if self._bypass_vector_index:
self._vector_query.bypass_vector_index()
if self._lower_bound or self._upper_bound:
self._vector_query.distance_range(
lower_bound=self._lower_bound, upper_bound=self._upper_bound
)
if self._reranker is None:
self._reranker = RRFReranker()
self._create_query_builders()
with ThreadPoolExecutor() as executor:
fts_future = executor.submit(
self._fts_query.with_row_id(True).to_arrow, timeout=timeout
@@ -2003,6 +1959,112 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
self._bypass_vector_index = True
return self
def explain_plan(self, verbose: Optional[bool] = False) -> str:
"""Return the execution plan for this query.
Examples
--------
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", [{"vector": [99.0, 99]}])
>>> query = [100, 100]
>>> plan = table.search(query).explain_plan(True)
>>> print(plan) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
ProjectionExec: expr=[vector@0 as vector, _distance@2 as _distance]
GlobalLimitExec: skip=0, fetch=10
FilterExec: _distance@2 IS NOT NULL
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], preserve_partitioning=[false]
KNNVectorDistance: metric=l2
LanceScan: uri=..., projection=[vector], row_id=true, row_addr=false, ordered=false
Parameters
----------
verbose : bool, default False
Use a verbose output format.
Returns
-------
plan : str
""" # noqa: E501
self._create_query_builders()
results = ["Vector Search Plan:"]
results.append(
self._table._explain_plan(
self._vector_query.to_query_object(), verbose=verbose
)
)
results.append("FTS Search Plan:")
results.append(
self._table._explain_plan(
self._fts_query.to_query_object(), verbose=verbose
)
)
return "\n".join(results)
def analyze_plan(self):
"""Execute the query and display with runtime metrics.
Returns
-------
plan : str
"""
self._create_query_builders()
results = ["Vector Search Plan:"]
results.append(self._table._analyze_plan(self._vector_query.to_query_object()))
results.append("FTS Search Plan:")
results.append(self._table._analyze_plan(self._fts_query.to_query_object()))
return "\n".join(results)
def _create_query_builders(self):
"""Set up and configure the vector and FTS query builders."""
vector_query, fts_query = self._validate_query(
self._query, self._vector, self._text
)
self._fts_query = LanceFtsQueryBuilder(
self._table, fts_query, fts_columns=self._fts_columns
)
vector_query = self._query_to_vector(
self._table, vector_query, self._vector_column
)
self._vector_query = LanceVectorQueryBuilder(
self._table, vector_query, self._vector_column
)
# Apply common configurations
if self._limit:
self._vector_query.limit(self._limit)
self._fts_query.limit(self._limit)
if self._columns:
self._vector_query.select(self._columns)
self._fts_query.select(self._columns)
if self._where:
self._vector_query.where(self._where, self._postfilter)
self._fts_query.where(self._where, self._postfilter)
if self._with_row_id:
self._vector_query.with_row_id(True)
self._fts_query.with_row_id(True)
if self._phrase_query:
self._fts_query.phrase_query(True)
if self._distance_type:
self._vector_query.metric(self._distance_type)
if self._nprobes:
self._vector_query.nprobes(self._nprobes)
if self._refine_factor:
self._vector_query.refine_factor(self._refine_factor)
if self._ef:
self._vector_query.ef(self._ef)
if self._bypass_vector_index:
self._vector_query.bypass_vector_index()
if self._lower_bound or self._upper_bound:
self._vector_query.distance_range(
lower_bound=self._lower_bound, upper_bound=self._upper_bound
)
if self._reranker is None:
self._reranker = RRFReranker()
class AsyncQueryBase(object):
def __init__(self, inner: Union[LanceQuery, LanceVectorQuery]):

View File

@@ -18,7 +18,7 @@ from lancedb.merge import LanceMergeInsertBuilder
from lancedb.embeddings import EmbeddingFunctionRegistry
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder
from ..table import AsyncTable, IndexStatistics, Query, Table
from ..table import AsyncTable, IndexStatistics, Query, Table, Tags
class RemoteTable(Table):
@@ -54,6 +54,10 @@ class RemoteTable(Table):
"""Get the current version of the table"""
return LOOP.run(self._table.version())
@property
def tags(self) -> Tags:
return Tags(self._table)
@cached_property
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
"""
@@ -81,7 +85,7 @@ class RemoteTable(Table):
"""to_pandas() is not yet supported on LanceDB cloud."""
return NotImplementedError("to_pandas() is not yet supported on LanceDB cloud.")
def checkout(self, version: int):
def checkout(self, version: Union[int, str]):
return LOOP.run(self._table.checkout(version))
def checkout_latest(self):
@@ -574,6 +578,9 @@ class RemoteTable(Table):
):
return LOOP.run(self._table.wait_for_index(index_names, timeout))
def stats(self):
return LOOP.run(self._table.stats())
def uses_v2_manifest_paths(self) -> bool:
raise NotImplementedError(
"uses_v2_manifest_paths() is not supported on the LanceDB Cloud"

View File

@@ -77,6 +77,7 @@ if TYPE_CHECKING:
OptimizeStats,
CleanupStats,
CompactionStats,
Tag,
)
from .db import LanceDBConnection
from .index import IndexConfig
@@ -582,6 +583,35 @@ class Table(ABC):
"""
raise NotImplementedError
@property
@abstractmethod
def tags(self) -> Tags:
"""Tag management for the table.
Similar to Git, tags are a way to add metadata to a specific version of the
table.
.. warning::
Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`
process.
To remove a version that has been tagged, you must first
:py:meth:`~Tags.delete` the associated tag.
Examples
--------
.. code-block:: python
table = db.open_table("my_table")
table.tags.create("v2-prod-20250203", 10)
tags = table.tags.list()
"""
raise NotImplementedError
@property
@abstractmethod
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
@@ -709,6 +739,13 @@ class Table(ABC):
"""
raise NotImplementedError
@abstractmethod
def stats(self) -> TableStatistics:
"""
Retrieve table and fragment statistics.
"""
raise NotImplementedError
@abstractmethod
def create_scalar_index(
self,
@@ -1354,7 +1391,7 @@ class Table(ABC):
"""
@abstractmethod
def checkout(self, version: int):
def checkout(self, version: Union[int, str]):
"""
Checks out a specific version of the Table
@@ -1369,6 +1406,12 @@ class Table(ABC):
Any operation that modifies the table will fail while the table is in a checked
out state.
Parameters
----------
version: int | str,
The version to check out. A version number (`int`) or a tag
(`str`) can be provided.
To return the table to a normal state use `[Self::checkout_latest]`
"""
@@ -1538,7 +1581,45 @@ class LanceTable(Table):
"""Get the current version of the table"""
return LOOP.run(self._table.version())
def checkout(self, version: int):
@property
def tags(self) -> Tags:
"""Tag management for the table.
Similar to Git, tags are a way to add metadata to a specific version of the
table.
.. warning::
Tagged versions are exempted from the :py:meth:`cleanup_old_versions()`
process.
To remove a version that has been tagged, you must first
:py:meth:`~Tags.delete` the associated tag.
Returns
-------
Tags
The tag manager for managing tags for the table.
Examples
--------
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table",
... [{"vector": [1.1, 0.9], "type": "vector"}])
>>> table.tags.create("v1", table.version)
>>> table.add([{"vector": [0.5, 0.2], "type": "vector"}])
>>> tags = table.tags.list()
>>> print(tags["v1"]["version"])
1
>>> table.checkout("v1")
>>> table.to_pandas()
vector type
0 [1.1, 0.9] vector
"""
return Tags(self._table)
def checkout(self, version: Union[int, str]):
"""Checkout a version of the table. This is an in-place operation.
This allows viewing previous versions of the table. If you wish to
@@ -1550,8 +1631,9 @@ class LanceTable(Table):
Parameters
----------
version : int
The version to checkout.
version: int | str,
The version to check out. A version number (`int`) or a tag
(`str`) can be provided.
Examples
--------
@@ -1801,6 +1883,9 @@ class LanceTable(Table):
) -> None:
return LOOP.run(self._table.wait_for_index(index_names, timeout))
def stats(self) -> TableStatistics:
return LOOP.run(self._table.stats())
def create_scalar_index(
self,
column: str,
@@ -3095,6 +3180,12 @@ class AsyncTable:
"""
await self._inner.wait_for_index(index_names, timeout)
async def stats(self) -> TableStatistics:
"""
Retrieve table and fragment statistics.
"""
return await self._inner.stats()
async def add(
self,
data: DATA,
@@ -3746,7 +3837,7 @@ class AsyncTable:
return versions
async def checkout(self, version: int):
async def checkout(self, version: int | str):
"""
Checks out a specific version of the Table
@@ -3761,6 +3852,12 @@ class AsyncTable:
Any operation that modifies the table will fail while the table is in a checked
out state.
Parameters
----------
version: int | str,
The version to check out. A version number (`int`) or a tag
(`str`) can be provided.
To return the table to a normal state use `[Self::checkout_latest]`
"""
try:
@@ -3798,6 +3895,24 @@ class AsyncTable:
"""
await self._inner.restore(version)
@property
def tags(self) -> AsyncTags:
"""Tag management for the dataset.
Similar to Git, tags are a way to add metadata to a specific version of the
dataset.
.. warning::
Tagged versions are exempted from the
:py:meth:`optimize(cleanup_older_than)` process.
To remove a version that has been tagged, you must first
:py:meth:`~Tags.delete` the associated tag.
"""
return AsyncTags(self._inner)
async def optimize(
self,
*,
@@ -3967,3 +4082,217 @@ class IndexStatistics:
# a dictionary instead of a class.
def __getitem__(self, key):
return getattr(self, key)
@dataclass
class TableStatistics:
"""
Statistics about a table and fragments.
Attributes
----------
total_bytes: int
The total number of bytes in the table.
num_rows: int
The total number of rows in the table.
num_indices: int
The total number of indices in the table.
fragment_stats: FragmentStatistics
Statistics about fragments in the table.
"""
total_bytes: int
num_rows: int
num_indices: int
fragment_stats: FragmentStatistics
@dataclass
class FragmentStatistics:
"""
Statistics about fragments.
Attributes
----------
num_fragments: int
The total number of fragments in the table.
num_small_fragments: int
The total number of small fragments in the table.
Small fragments have low row counts and may need to be compacted.
lengths: FragmentSummaryStats
Statistics about the number of rows in the table fragments.
"""
num_fragments: int
num_small_fragments: int
lengths: FragmentSummaryStats
@dataclass
class FragmentSummaryStats:
"""
Statistics about fragments sizes
Attributes
----------
min: int
The number of rows in the fragment with the fewest rows.
max: int
The number of rows in the fragment with the most rows.
mean: int
The mean number of rows in the fragments.
p25: int
The 25th percentile of number of rows in the fragments.
p50: int
The 50th percentile of number of rows in the fragments.
p75: int
The 75th percentile of number of rows in the fragments.
p99: int
The 99th percentile of number of rows in the fragments.
"""
min: int
max: int
mean: int
p25: int
p50: int
p75: int
p99: int
class Tags:
"""
Table tag manager.
"""
def __init__(self, table):
self._table = table
def list(self) -> Dict[str, Tag]:
"""
List all table tags.
Returns
-------
dict[str, Tag]
A dictionary mapping tag names to version numbers.
"""
return LOOP.run(self._table.tags.list())
def get_version(self, tag: str) -> int:
"""
Get the version of a tag.
Parameters
----------
tag: str,
The name of the tag to get the version for.
"""
return LOOP.run(self._table.tags.get_version(tag))
def create(self, tag: str, version: int) -> None:
"""
Create a tag for a given table version.
Parameters
----------
tag: str,
The name of the tag to create. This name must be unique among all tag
names for the table.
version: int,
The table version to tag.
"""
LOOP.run(self._table.tags.create(tag, version))
def delete(self, tag: str) -> None:
"""
Delete tag from the table.
Parameters
----------
tag: str,
The name of the tag to delete.
"""
LOOP.run(self._table.tags.delete(tag))
def update(self, tag: str, version: int) -> None:
"""
Update tag to a new version.
Parameters
----------
tag: str,
The name of the tag to update.
version: int,
The new table version to tag.
"""
LOOP.run(self._table.tags.update(tag, version))
class AsyncTags:
"""
Async table tag manager.
"""
def __init__(self, table):
self._table = table
async def list(self) -> Dict[str, Tag]:
"""
List all table tags.
Returns
-------
dict[str, Tag]
A dictionary mapping tag names to version numbers.
"""
return await self._table.tags.list()
async def get_version(self, tag: str) -> int:
"""
Get the version of a tag.
Parameters
----------
tag: str,
The name of the tag to get the version for.
"""
return await self._table.tags.get_version(tag)
async def create(self, tag: str, version: int) -> None:
"""
Create a tag for a given table version.
Parameters
----------
tag: str,
The name of the tag to create. This name must be unique among all tag
names for the table.
version: int,
The table version to tag.
"""
await self._table.tags.create(tag, version)
async def delete(self, tag: str) -> None:
"""
Delete tag from the table.
Parameters
----------
tag: str,
The name of the tag to delete.
"""
await self._table.tags.delete(tag)
async def update(self, tag: str, version: int) -> None:
"""
Update tag to a new version.
Parameters
----------
tag: str,
The name of the tag to update.
version: int,
The new table version to tag.
"""
await self._table.tags.update(tag, version)

View File

@@ -389,6 +389,50 @@ def test_table_wait_for_index_timeout():
table.wait_for_index(["id_idx"], timedelta(seconds=1))
def test_stats():
stats = {
"total_bytes": 38,
"num_rows": 2,
"num_indices": 0,
"fragment_stats": {
"num_fragments": 1,
"num_small_fragments": 1,
"lengths": {
"min": 2,
"max": 2,
"mean": 2,
"p25": 2,
"p50": 2,
"p75": 2,
"p99": 2,
},
},
}
def handler(request):
if request.path == "/v1/table/test/create/?mode=create":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"{}")
elif request.path == "/v1/table/test/stats/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
payload = json.dumps(stats)
request.wfile.write(payload.encode())
else:
print(request.path)
request.send_response(404)
request.end_headers()
with mock_lancedb_connection(handler) as db:
table = db.create_table("test", [{"id": 1}])
res = table.stats()
print(f"{res=}")
assert res == stats
@contextlib.contextmanager
def query_test_table(query_handler, *, server_version=Version("0.1.0")):
def handler(request):

View File

@@ -529,6 +529,113 @@ def test_versioning(mem_db: DBConnection):
assert len(table) == 2
def test_tags(mem_db: DBConnection):
table = mem_db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
)
table.tags.create("tag1", 1)
tags = table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 1
table.add(
data=[
{"vector": [10.0, 11.0], "item": "baz", "price": 30.0},
],
)
table.tags.create("tag2", 2)
tags = table.tags.list()
assert "tag1" in tags
assert "tag2" in tags
assert tags["tag1"]["version"] == 1
assert tags["tag2"]["version"] == 2
table.tags.delete("tag2")
table.tags.update("tag1", 2)
tags = table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 2
table.tags.update("tag1", 1)
tags = table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 1
table.checkout("tag1")
assert table.version == 1
assert table.count_rows() == 2
table.tags.create("tag2", 2)
table.checkout("tag2")
assert table.version == 2
assert table.count_rows() == 3
table.checkout_latest()
table.add(
data=[
{"vector": [12.0, 13.0], "item": "baz", "price": 40.0},
],
)
@pytest.mark.asyncio
async def test_async_tags(mem_db_async: AsyncConnection):
table = await mem_db_async.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
)
await table.tags.create("tag1", 1)
tags = await table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 1
await table.add(
data=[
{"vector": [10.0, 11.0], "item": "baz", "price": 30.0},
],
)
await table.tags.create("tag2", 2)
tags = await table.tags.list()
assert "tag1" in tags
assert "tag2" in tags
assert tags["tag1"]["version"] == 1
assert tags["tag2"]["version"] == 2
await table.tags.delete("tag2")
await table.tags.update("tag1", 2)
tags = await table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 2
await table.tags.update("tag1", 1)
tags = await table.tags.list()
assert "tag1" in tags
assert tags["tag1"]["version"] == 1
await table.checkout("tag1")
assert await table.version() == 1
assert await table.count_rows() == 2
await table.tags.create("tag2", 2)
await table.checkout("tag2")
assert await table.version() == 2
assert await table.count_rows() == 3
await table.checkout_latest()
await table.add(
data=[
{"vector": [12.0, 13.0], "item": "baz", "price": 40.0},
],
)
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_method(mock_create_index, mem_db: DBConnection):
table = mem_db.create_table(
@@ -1588,3 +1695,31 @@ def test_replace_field_metadata(tmp_path):
schema = table.schema
field = schema[0].metadata
assert field == {b"foo": b"bar"}
def test_stats(mem_db: DBConnection):
table = mem_db.create_table(
"my_table",
data=[{"text": "foo", "id": 0}, {"text": "bar", "id": 1}],
)
assert len(table) == 2
stats = table.stats()
print(f"{stats=}")
assert stats == {
"total_bytes": 38,
"num_rows": 2,
"num_indices": 0,
"fragment_stats": {
"num_fragments": 1,
"num_small_fragments": 1,
"lengths": {
"min": 2,
"max": 2,
"mean": 2,
"p25": 2,
"p50": 2,
"p75": 2,
"p99": 2,
},
},
}

View File

@@ -2,6 +2,11 @@
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{collections::HashMap, sync::Arc};
use crate::{
error::PythonErrorExt,
index::{extract_index_params, IndexConfig},
query::Query,
};
use arrow::{
datatypes::{DataType, Schema},
ffi_stream::ArrowArrayStreamReader,
@@ -12,19 +17,13 @@ use lancedb::table::{
Table as LanceDbTable,
};
use pyo3::{
exceptions::{PyKeyError, PyRuntimeError, PyValueError},
exceptions::{PyIOError, PyKeyError, PyRuntimeError, PyValueError},
pyclass, pymethods,
types::{IntoPyDict, PyAnyMethods, PyDict, PyDictMethods},
Bound, FromPyObject, PyAny, PyRef, PyResult, Python,
types::{IntoPyDict, PyAnyMethods, PyDict, PyDictMethods, PyInt, PyString},
Bound, FromPyObject, PyAny, PyObject, PyRef, PyResult, Python,
};
use pyo3_async_runtimes::tokio::future_into_py;
use crate::{
error::PythonErrorExt,
index::{extract_index_params, IndexConfig},
query::Query,
};
/// Statistics about a compaction operation.
#[pyclass(get_all)]
#[derive(Clone, Debug)]
@@ -280,6 +279,40 @@ impl Table {
})
}
pub fn stats(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
let stats = inner.stats().await.infer_error()?;
Python::with_gil(|py| {
let dict = PyDict::new(py);
dict.set_item("total_bytes", stats.total_bytes)?;
dict.set_item("num_rows", stats.num_rows)?;
dict.set_item("num_indices", stats.num_indices)?;
let fragment_stats = PyDict::new(py);
fragment_stats.set_item("num_fragments", stats.fragment_stats.num_fragments)?;
fragment_stats.set_item(
"num_small_fragments",
stats.fragment_stats.num_small_fragments,
)?;
let fragment_lengths = PyDict::new(py);
fragment_lengths.set_item("min", stats.fragment_stats.lengths.min)?;
fragment_lengths.set_item("max", stats.fragment_stats.lengths.max)?;
fragment_lengths.set_item("mean", stats.fragment_stats.lengths.mean)?;
fragment_lengths.set_item("p25", stats.fragment_stats.lengths.p25)?;
fragment_lengths.set_item("p50", stats.fragment_stats.lengths.p50)?;
fragment_lengths.set_item("p75", stats.fragment_stats.lengths.p75)?;
fragment_lengths.set_item("p99", stats.fragment_stats.lengths.p99)?;
fragment_stats.set_item("lengths", fragment_lengths)?;
dict.set_item("fragment_stats", fragment_stats)?;
Ok(Some(dict.unbind()))
})
})
}
pub fn __repr__(&self) -> String {
match &self.inner {
None => format!("ClosedTable({})", self.name),
@@ -322,10 +355,26 @@ impl Table {
})
}
pub fn checkout(self_: PyRef<'_, Self>, version: u64) -> PyResult<Bound<'_, PyAny>> {
pub fn checkout(self_: PyRef<'_, Self>, version: PyObject) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner.checkout(version).await.infer_error()
let py = self_.py();
let (is_int, int_value, string_value) = if let Ok(i) = version.downcast_bound::<PyInt>(py) {
let num: u64 = i.extract()?;
(true, num, String::new())
} else if let Ok(s) = version.downcast_bound::<PyString>(py) {
let str_value = s.to_string();
(false, 0, str_value)
} else {
return Err(PyIOError::new_err(
"version must be an integer or a string.",
));
};
future_into_py(py, async move {
if is_int {
inner.checkout(int_value).await.infer_error()
} else {
inner.checkout_tag(&string_value).await.infer_error()
}
})
}
@@ -352,6 +401,11 @@ impl Table {
Query::new(self.inner_ref().unwrap().query())
}
#[getter]
pub fn tags(&self) -> PyResult<Tags> {
Ok(Tags::new(self.inner_ref()?.clone()))
}
/// Optimize the on-disk data by compacting and pruning old data, for better performance.
#[pyo3(signature = (cleanup_since_ms=None, delete_unverified=None, retrain=None))]
pub fn optimize(
@@ -586,3 +640,72 @@ pub struct MergeInsertParams {
when_not_matched_by_source_delete: bool,
when_not_matched_by_source_condition: Option<String>,
}
#[pyclass]
pub struct Tags {
inner: LanceDbTable,
}
impl Tags {
pub fn new(table: LanceDbTable) -> Self {
Self { inner: table }
}
}
#[pymethods]
impl Tags {
pub fn list(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let tags = inner.tags().await.infer_error()?;
let res = tags.list().await.infer_error()?;
Python::with_gil(|py| {
let py_dict = PyDict::new(py);
for (key, contents) in res {
let value_dict = PyDict::new(py);
value_dict.set_item("version", contents.version)?;
value_dict.set_item("manifest_size", contents.manifest_size)?;
py_dict.set_item(key, value_dict)?;
}
Ok(py_dict.unbind())
})
})
}
pub fn get_version(self_: PyRef<'_, Self>, tag: String) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let tags = inner.tags().await.infer_error()?;
let res = tags.get_version(tag.as_str()).await.infer_error()?;
Ok(res)
})
}
pub fn create(self_: PyRef<Self>, tag: String, version: u64) -> PyResult<Bound<PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let mut tags = inner.tags().await.infer_error()?;
tags.create(tag.as_str(), version).await.infer_error()?;
Ok(())
})
}
pub fn delete(self_: PyRef<Self>, tag: String) -> PyResult<Bound<PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let mut tags = inner.tags().await.infer_error()?;
tags.delete(tag.as_str()).await.infer_error()?;
Ok(())
})
}
pub fn update(self_: PyRef<Self>, tag: String, version: u64) -> PyResult<Bound<PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let mut tags = inner.tags().await.infer_error()?;
tags.update(tag.as_str(), version).await.infer_error()?;
Ok(())
})
}
}

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-node"
version = "0.19.0-beta.11"
version = "0.19.1-beta.0"
description = "Serverless, low-latency vector database for AI applications"
license.workspace = true
edition.workspace = true

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.19.0-beta.11"
version = "0.19.1-beta.0"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true

View File

@@ -4,7 +4,8 @@
use crate::index::Index;
use crate::index::IndexStatistics;
use crate::query::{QueryFilter, QueryRequest, Select, VectorQueryRequest};
use crate::table::{AddDataMode, AnyQuery, Filter};
use crate::table::Tags;
use crate::table::{AddDataMode, AnyQuery, Filter, TableStatistics};
use crate::utils::{supported_btree_data_type, supported_vector_data_type};
use crate::{DistanceType, Error, Table};
use arrow_array::{RecordBatch, RecordBatchIterator, RecordBatchReader};
@@ -18,11 +19,13 @@ use futures::TryStreamExt;
use http::header::CONTENT_TYPE;
use http::{HeaderName, StatusCode};
use lance::arrow::json::{JsonDataType, JsonSchema};
use lance::dataset::refs::TagContents;
use lance::dataset::scanner::DatasetRecordBatchStream;
use lance::dataset::{ColumnAlteration, NewColumnTransform, Version};
use lance_datafusion::exec::{execute_plan, OneShotExec};
use reqwest::{RequestBuilder, Response};
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
use std::io::Cursor;
use std::pin::Pin;
use std::sync::{Arc, Mutex};
@@ -47,6 +50,137 @@ use crate::{
const REQUEST_TIMEOUT_HEADER: HeaderName = HeaderName::from_static("x-request-timeout-ms");
pub struct RemoteTags<'a, S: HttpSend = Sender> {
inner: &'a RemoteTable<S>,
}
#[async_trait]
impl<S: HttpSend + 'static> Tags for RemoteTags<'_, S> {
async fn list(&self) -> Result<HashMap<String, TagContents>> {
let request = self
.inner
.client
.post(&format!("/v1/table/{}/tags/list/", self.inner.name));
let (request_id, response) = self.inner.send(request, true).await?;
let response = self
.inner
.check_table_response(&request_id, response)
.await?;
match response.text().await {
Ok(body) => {
// Explicitly tell serde_json what type we want to deserialize into
let tags_map: HashMap<String, TagContents> =
serde_json::from_str(&body).map_err(|e| Error::Http {
source: format!("Failed to parse tags list: {}", e).into(),
request_id,
status_code: None,
})?;
Ok(tags_map)
}
Err(err) => {
let status_code = err.status();
Err(Error::Http {
source: Box::new(err),
request_id,
status_code,
})
}
}
}
async fn get_version(&self, tag: &str) -> Result<u64> {
let request = self
.inner
.client
.post(&format!("/v1/table/{}/tags/version/", self.inner.name))
.json(&serde_json::json!({ "tag": tag }));
let (request_id, response) = self.inner.send(request, true).await?;
let response = self
.inner
.check_table_response(&request_id, response)
.await?;
match response.text().await {
Ok(body) => {
let value: serde_json::Value =
serde_json::from_str(&body).map_err(|e| Error::Http {
source: format!("Failed to parse tag version: {}", e).into(),
request_id: request_id.clone(),
status_code: None,
})?;
value
.get("version")
.and_then(|v| v.as_u64())
.ok_or_else(|| Error::Http {
source: format!("Invalid tag version response: {}", body).into(),
request_id,
status_code: None,
})
}
Err(err) => {
let status_code = err.status();
Err(Error::Http {
source: Box::new(err),
request_id,
status_code,
})
}
}
}
async fn create(&mut self, tag: &str, version: u64) -> Result<()> {
let request = self
.inner
.client
.post(&format!("/v1/table/{}/tags/create/", self.inner.name))
.json(&serde_json::json!({
"tag": tag,
"version": version
}));
let (request_id, response) = self.inner.send(request, true).await?;
self.inner
.check_table_response(&request_id, response)
.await?;
Ok(())
}
async fn delete(&mut self, tag: &str) -> Result<()> {
let request = self
.inner
.client
.post(&format!("/v1/table/{}/tags/delete/", self.inner.name))
.json(&serde_json::json!({ "tag": tag }));
let (request_id, response) = self.inner.send(request, true).await?;
self.inner
.check_table_response(&request_id, response)
.await?;
Ok(())
}
async fn update(&mut self, tag: &str, version: u64) -> Result<()> {
let request = self
.inner
.client
.post(&format!("/v1/table/{}/tags/update/", self.inner.name))
.json(&serde_json::json!({
"tag": tag,
"version": version
}));
let (request_id, response) = self.inner.send(request, true).await?;
self.inner
.check_table_response(&request_id, response)
.await?;
Ok(())
}
}
#[derive(Debug)]
pub struct RemoteTable<S: HttpSend = Sender> {
#[allow(dead_code)]
@@ -905,6 +1039,16 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
Ok(())
}
async fn tags(&self) -> Result<Box<dyn Tags + '_>> {
Ok(Box::new(RemoteTags { inner: self }))
}
async fn checkout_tag(&self, tag: &str) -> Result<()> {
let tags = self.tags().await?;
let version = tags.get_version(tag).await?;
let mut write_guard = self.version.write().await;
*write_guard = Some(version);
Ok(())
}
async fn optimize(&self, _action: OptimizeAction) -> Result<OptimizeStats> {
self.check_mutable().await?;
Err(Error::NotSupported {
@@ -1098,6 +1242,20 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
fn dataset_uri(&self) -> &str {
"NOT_SUPPORTED"
}
async fn stats(&self) -> Result<TableStatistics> {
let request = self.client.post(&format!("/v1/table/{}/stats/", self.name));
let (request_id, response) = self.send(request, true).await?;
let response = self.check_table_response(&request_id, response).await?;
let body = response.text().await.err_to_http(request_id.clone())?;
let stats = serde_json::from_str(&body).map_err(|e| Error::Http {
source: format!("Failed to parse table statistics: {}", e).into(),
request_id,
status_code: None,
})?;
Ok(stats)
}
}
#[derive(Serialize)]

View File

@@ -80,9 +80,13 @@ pub mod merge;
use crate::index::waiter::wait_for_index;
pub use chrono::Duration;
use futures::future::join_all;
pub use lance::dataset::optimize::CompactionOptions;
pub use lance::dataset::refs::{TagContents, Tags as LanceTags};
pub use lance::dataset::scanner::DatasetRecordBatchStream;
use lance::dataset::statistics::DatasetStatisticsExt;
pub use lance_index::optimize::OptimizeOptions;
use serde_with::skip_serializing_none;
/// Defines the type of column
#[derive(Debug, Clone, Serialize, Deserialize)]
@@ -401,6 +405,24 @@ pub enum AnyQuery {
VectorQuery(VectorQueryRequest),
}
#[async_trait]
pub trait Tags: Send + Sync {
/// List the tags of the table.
async fn list(&self) -> Result<HashMap<String, TagContents>>;
/// Get the version of the table referenced by a tag.
async fn get_version(&self, tag: &str) -> Result<u64>;
/// Create a new tag for the given version of the table.
async fn create(&mut self, tag: &str, version: u64) -> Result<()>;
/// Delete a tag from the table.
async fn delete(&mut self, tag: &str) -> Result<()>;
/// Update an existing tag to point to a new version of the table.
async fn update(&mut self, tag: &str, version: u64) -> Result<()>;
}
/// A trait for anything "table-like". This is used for both native tables (which target
/// Lance datasets) and remote tables (which target LanceDB cloud)
///
@@ -466,6 +488,8 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
params: MergeInsertBuilder,
new_data: Box<dyn RecordBatchReader + Send>,
) -> Result<()>;
/// Gets the table tag manager.
async fn tags(&self) -> Result<Box<dyn Tags + '_>>;
/// Optimize the dataset.
async fn optimize(&self, action: OptimizeAction) -> Result<OptimizeStats>;
/// Add columns to the table.
@@ -482,6 +506,9 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
async fn version(&self) -> Result<u64>;
/// Checkout a specific version of the table.
async fn checkout(&self, version: u64) -> Result<()>;
/// Checkout a table version referenced by a tag.
/// Tags provide a human-readable way to reference specific versions of the table.
async fn checkout_tag(&self, tag: &str) -> Result<()>;
/// Checkout the latest version of the table.
async fn checkout_latest(&self) -> Result<()>;
/// Restore the table to the currently checked out version.
@@ -499,6 +526,8 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
index_names: &[&str],
timeout: std::time::Duration,
) -> Result<()>;
/// Get statistics on the table
async fn stats(&self) -> Result<TableStatistics>;
}
/// A Table is a collection of strong typed Rows.
@@ -1058,6 +1087,24 @@ impl Table {
self.inner.checkout(version).await
}
/// Checks out a specific version of the Table by tag
///
/// Any read operation on the table will now access the data at the version referenced by the tag.
/// As a consequence, calling this method will disable any read consistency interval
/// that was previously set.
///
/// This is a read-only operation that turns the table into a sort of "view"
/// or "detached head". Other table instances will not be affected. To make the change
/// permanent you can use the `[Self::restore]` method.
///
/// Any operation that modifies the table will fail while the table is in a checked
/// out state.
///
/// To return the table to a normal state use `[Self::checkout_latest]`
pub async fn checkout_tag(&self, tag: &str) -> Result<()> {
self.inner.checkout_tag(tag).await
}
/// Ensures the table is pointing at the latest version
///
/// This can be used to manually update a table when the read_consistency_interval is None
@@ -1144,6 +1191,11 @@ impl Table {
self.inner.wait_for_index(index_names, timeout).await
}
/// Get the tags manager.
pub async fn tags(&self) -> Result<Box<dyn Tags + '_>> {
self.inner.tags().await
}
// Take many execution plans and map them into a single plan that adds
// a query_index column and unions them.
pub(crate) fn multi_vector_plan(
@@ -1194,6 +1246,40 @@ impl Table {
.unwrap();
Ok(Arc::new(repartitioned))
}
/// Retrieve statistics on the table
pub async fn stats(&self) -> Result<TableStatistics> {
self.inner.stats().await
}
}
pub struct NativeTags {
inner: LanceTags,
}
#[async_trait]
impl Tags for NativeTags {
async fn list(&self) -> Result<HashMap<String, TagContents>> {
Ok(self.inner.list().await?)
}
async fn get_version(&self, tag: &str) -> Result<u64> {
Ok(self.inner.get_version(tag).await?)
}
async fn create(&mut self, tag: &str, version: u64) -> Result<()> {
self.inner.create(tag, version).await?;
Ok(())
}
async fn delete(&mut self, tag: &str) -> Result<()> {
self.inner.delete(tag).await?;
Ok(())
}
async fn update(&mut self, tag: &str, version: u64) -> Result<()> {
self.inner.update(tag, version).await?;
Ok(())
}
}
impl From<NativeTable> for Table {
@@ -1940,6 +2026,10 @@ impl BaseTable for NativeTable {
self.dataset.as_time_travel(version).await
}
async fn checkout_tag(&self, tag: &str) -> Result<()> {
self.dataset.as_time_travel(tag).await
}
async fn checkout_latest(&self) -> Result<()> {
self.dataset
.as_latest(self.read_consistency_interval)
@@ -2315,6 +2405,14 @@ impl BaseTable for NativeTable {
Ok(())
}
async fn tags(&self) -> Result<Box<dyn Tags + '_>> {
let dataset = self.dataset.get().await?;
Ok(Box::new(NativeTags {
inner: dataset.tags.clone(),
}))
}
async fn optimize(&self, action: OptimizeAction) -> Result<OptimizeStats> {
let mut stats = OptimizeStats {
compaction: None,
@@ -2480,6 +2578,108 @@ impl BaseTable for NativeTable {
) -> Result<()> {
wait_for_index(self, index_names, timeout).await
}
async fn stats(&self) -> Result<TableStatistics> {
let num_rows = self.count_rows(None).await?;
let num_indices = self.list_indices().await?.len();
let ds = self.dataset.get().await?;
let ds_clone = (*ds).clone();
let ds_stats = Arc::new(ds_clone).calculate_data_stats().await?;
let total_bytes = ds_stats.fields.iter().map(|f| f.bytes_on_disk).sum::<u64>() as usize;
let frags = ds.get_fragments();
let mut sorted_sizes = join_all(
frags
.iter()
.map(|frag| async move { frag.physical_rows().await.unwrap_or(0) }),
)
.await;
sorted_sizes.sort();
let small_frag_threshold = 100000;
let num_fragments = sorted_sizes.len();
let num_small_fragments = sorted_sizes
.iter()
.filter(|&&size| size < small_frag_threshold)
.count();
let p25 = *sorted_sizes.get(num_fragments / 4).unwrap_or(&0);
let p50 = *sorted_sizes.get(num_fragments / 2).unwrap_or(&0);
let p75 = *sorted_sizes.get(num_fragments * 3 / 4).unwrap_or(&0);
let p99 = *sorted_sizes.get(num_fragments * 99 / 100).unwrap_or(&0);
let min = sorted_sizes.first().copied().unwrap_or(0);
let max = sorted_sizes.last().copied().unwrap_or(0);
let mean = if num_fragments == 0 {
0
} else {
sorted_sizes.iter().copied().sum::<usize>() / num_fragments
};
let frag_stats = FragmentStatistics {
num_fragments,
num_small_fragments,
lengths: FragmentSummaryStats {
min,
max,
mean,
p25,
p50,
p75,
p99,
},
};
let stats = TableStatistics {
total_bytes,
num_rows,
num_indices,
fragment_stats: frag_stats,
};
Ok(stats)
}
}
#[skip_serializing_none]
#[derive(Debug, Deserialize, PartialEq)]
pub struct TableStatistics {
/// The total number of bytes in the table
pub total_bytes: usize,
/// The number of rows in the table
pub num_rows: usize,
/// The number of indices in the table
pub num_indices: usize,
/// Statistics on table fragments
pub fragment_stats: FragmentStatistics,
}
#[skip_serializing_none]
#[derive(Debug, Deserialize, PartialEq)]
pub struct FragmentStatistics {
/// The number of fragments in the table
pub num_fragments: usize,
/// The number of uncompacted fragments in the table
pub num_small_fragments: usize,
/// Statistics on the number of rows in the table fragments
pub lengths: FragmentSummaryStats,
// todo: add size statistics
// /// Statistics on the number of bytes in the table fragments
// sizes: FragmentStats,
}
#[skip_serializing_none]
#[derive(Debug, Deserialize, PartialEq)]
pub struct FragmentSummaryStats {
pub min: usize,
pub max: usize,
pub mean: usize,
pub p25: usize,
pub p50: usize,
pub p75: usize,
pub p99: usize,
}
#[cfg(test)]
@@ -3081,6 +3281,60 @@ mod tests {
)
}
#[tokio::test]
async fn test_tags() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let conn = ConnectBuilder::new(uri)
.read_consistency_interval(Duration::from_secs(0))
.execute()
.await
.unwrap();
let table = conn
.create_table("my_table", some_sample_data())
.execute()
.await
.unwrap();
assert_eq!(table.version().await.unwrap(), 1);
table.add(some_sample_data()).execute().await.unwrap();
assert_eq!(table.version().await.unwrap(), 2);
let mut tags_manager = table.tags().await.unwrap();
let tags = tags_manager.list().await.unwrap();
assert!(tags.is_empty(), "Tags should be empty initially");
let tag1 = "tag1";
tags_manager.create(tag1, 1).await.unwrap();
assert_eq!(tags_manager.get_version(tag1).await.unwrap(), 1);
let tags = tags_manager.list().await.unwrap();
assert_eq!(tags.len(), 1);
assert!(tags.contains_key(tag1));
assert_eq!(tags.get(tag1).unwrap().version, 1);
tags_manager.create("tag2", 2).await.unwrap();
assert_eq!(tags_manager.get_version("tag2").await.unwrap(), 2);
let tags = tags_manager.list().await.unwrap();
assert_eq!(tags.len(), 2);
assert!(tags.contains_key(tag1));
assert_eq!(tags.get(tag1).unwrap().version, 1);
assert!(tags.contains_key("tag2"));
assert_eq!(tags.get("tag2").unwrap().version, 2);
// Test update and delete
table.add(some_sample_data()).execute().await.unwrap();
tags_manager.update(tag1, 3).await.unwrap();
assert_eq!(tags_manager.get_version(tag1).await.unwrap(), 3);
tags_manager.delete("tag2").await.unwrap();
let tags = tags_manager.list().await.unwrap();
assert_eq!(tags.len(), 1);
assert!(tags.contains_key(tag1));
assert_eq!(tags.get(tag1).unwrap().version, 3);
// Test checkout tag
table.add(some_sample_data()).execute().await.unwrap();
assert_eq!(table.version().await.unwrap(), 4);
table.checkout_tag(tag1).await.unwrap();
assert_eq!(table.version().await.unwrap(), 3);
table.checkout_latest().await.unwrap();
assert_eq!(table.version().await.unwrap(), 4);
}
#[tokio::test]
async fn test_create_index() {
use arrow_array::RecordBatch;
@@ -3803,4 +4057,108 @@ mod tests {
Some(&"test_field_val1".to_string())
);
}
#[tokio::test]
pub async fn test_stats() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let conn = ConnectBuilder::new(uri).execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("foo", DataType::Int32, true),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..100)),
Arc::new(Int32Array::from_iter_values(0..100)),
],
)
.unwrap();
let table = conn
.create_table(
"test_stats",
RecordBatchIterator::new(vec![Ok(batch.clone())], batch.schema()),
)
.execute()
.await
.unwrap();
for _ in 0..10 {
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..15)),
Arc::new(Int32Array::from_iter_values(0..15)),
],
)
.unwrap();
table
.add(RecordBatchIterator::new(
vec![Ok(batch.clone())],
batch.schema(),
))
.execute()
.await
.unwrap();
}
let empty_table = conn
.create_table(
"test_stats_empty",
RecordBatchIterator::new(vec![], batch.schema()),
)
.execute()
.await
.unwrap();
let res = table.stats().await.unwrap();
println!("{:#?}", res);
assert_eq!(
res,
TableStatistics {
num_rows: 250,
num_indices: 0,
total_bytes: 2000,
fragment_stats: FragmentStatistics {
num_fragments: 11,
num_small_fragments: 11,
lengths: FragmentSummaryStats {
min: 15,
max: 100,
mean: 22,
p25: 15,
p50: 15,
p75: 15,
p99: 100,
},
},
}
);
let res = empty_table.stats().await.unwrap();
println!("{:#?}", res);
assert_eq!(
res,
TableStatistics {
num_rows: 0,
num_indices: 0,
total_bytes: 0,
fragment_stats: FragmentStatistics {
num_fragments: 0,
num_small_fragments: 0,
lengths: FragmentSummaryStats {
min: 0,
max: 0,
mean: 0,
p25: 0,
p50: 0,
p75: 0,
p99: 0,
},
},
}
)
}
}

View File

@@ -7,7 +7,7 @@ use std::{
time::{self, Duration, Instant},
};
use lance::Dataset;
use lance::{dataset::refs, Dataset};
use tokio::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use crate::error::Result;
@@ -83,19 +83,32 @@ impl DatasetRef {
}
}
async fn as_time_travel(&mut self, target_version: u64) -> Result<()> {
async fn as_time_travel(&mut self, target_version: impl Into<refs::Ref>) -> Result<()> {
let target_ref = target_version.into();
match self {
Self::Latest { dataset, .. } => {
let new_dataset = dataset.checkout_version(target_ref.clone()).await?;
let version_value = new_dataset.version().version;
*self = Self::TimeTravel {
dataset: dataset.checkout_version(target_version).await?,
version: target_version,
dataset: new_dataset,
version: version_value,
};
}
Self::TimeTravel { dataset, version } => {
if *version != target_version {
let should_checkout = match &target_ref {
refs::Ref::Version(target_ver) => version != target_ver,
refs::Ref::Tag(_) => true, // Always checkout for tags
};
if should_checkout {
let new_dataset = dataset.checkout_version(target_ref).await?;
let version_value = new_dataset.version().version;
*self = Self::TimeTravel {
dataset: dataset.checkout_version(target_version).await?,
version: target_version,
dataset: new_dataset,
version: version_value,
};
}
}
@@ -175,7 +188,7 @@ impl DatasetConsistencyWrapper {
write_guard.as_latest(read_consistency_interval).await
}
pub async fn as_time_travel(&self, target_version: u64) -> Result<()> {
pub async fn as_time_travel(&self, target_version: impl Into<refs::Ref>) -> Result<()> {
self.0.write().await.as_time_travel(target_version).await
}