feat: add take_offsets and take_row_ids (#2584)

These operations have existed in lance for a long while and many users need to drop down to lance for this capability. This PR adds the API and implements it using filters (e.g. `_rowid IN (...)`) so that in doesn't currently add any load to `BaseTable`. I'm not sure that is sustainable as base table implementations may want to specialize how they handle this method. However, I figure it is a good starting point. In addition, unlike Lance, this API does not currently guarantee anything about the order of the take results. This is necessary for the fallback filter approach to work (SQL filters cannot guarantee result order)
2025-12-22 21:09:58 +00:00 · 2025-08-15 06:48:24 -07:00
parent 296205ef96
commit ed640a76d9
24 changed files with 1488 additions and 381 deletions
--- a/docs/src/js/classes/Query.md
+++ b/docs/src/js/classes/Query.md
@@ -14,7 +14,7 @@ A builder for LanceDB queries.

 ## Extends

- [`QueryBase`](QueryBase.md)&lt;`NativeQuery`&gt;
+- `StandardQueryBase`&lt;`NativeQuery`&gt;

 ## Properties

@@ -26,7 +26,7 @@ protected inner: Query | Promise<Query>;

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`inner`](QueryBase.md#inner)
+`StandardQueryBase.inner`

 ## Methods

@@ -73,7 +73,7 @@ AnalyzeExec verbose=true, metrics=[]

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`analyzePlan`](QueryBase.md#analyzeplan)
+`StandardQueryBase.analyzePlan`

 ***

@@ -107,7 +107,7 @@ single query)

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`execute`](QueryBase.md#execute)
+`StandardQueryBase.execute`

 ***

@@ -143,7 +143,7 @@ const plan = await table.query().nearestTo([0.5, 0.2]).explainPlan();

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`explainPlan`](QueryBase.md#explainplan)
+`StandardQueryBase.explainPlan`

 ***

@@ -164,7 +164,7 @@ Use [Table#optimize](Table.md#optimize) to index all un-indexed data.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`fastSearch`](QueryBase.md#fastsearch)
+`StandardQueryBase.fastSearch`

 ***

@@ -194,7 +194,7 @@ Use `where` instead

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`filter`](QueryBase.md#filter)
+`StandardQueryBase.filter`

 ***

@@ -216,7 +216,7 @@ fullTextSearch(query, options?): this

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`fullTextSearch`](QueryBase.md#fulltextsearch)
+`StandardQueryBase.fullTextSearch`

 ***

@@ -241,7 +241,7 @@ called then every valid row from the table will be returned.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`limit`](QueryBase.md#limit)
+`StandardQueryBase.limit`

 ***

@@ -325,6 +325,10 @@ nearestToText(query, columns?): Query
 offset(offset): this
 ```

+Set the number of rows to skip before returning results.
+
+This is useful for pagination.
+
 #### Parameters

 * **offset**: `number`
@@ -335,7 +339,7 @@ offset(offset): this

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`offset`](QueryBase.md#offset)
+`StandardQueryBase.offset`

 ***

@@ -388,7 +392,7 @@ object insertion order is easy to get wrong and `Map` is more foolproof.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`select`](QueryBase.md#select)
+`StandardQueryBase.select`

 ***

@@ -410,7 +414,7 @@ Collect the results as an array of objects.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`toArray`](QueryBase.md#toarray)
+`StandardQueryBase.toArray`

 ***

@@ -436,7 +440,7 @@ ArrowTable.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`toArrow`](QueryBase.md#toarrow)
+`StandardQueryBase.toArrow`

 ***

@@ -471,7 +475,7 @@ on the filter column(s).

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`where`](QueryBase.md#where)
+`StandardQueryBase.where`

 ***

@@ -493,4 +497,4 @@ order to perform hybrid search.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`withRowId`](QueryBase.md#withrowid)
+`StandardQueryBase.withRowId`
--- a/docs/src/js/classes/QueryBase.md
+++ b/docs/src/js/classes/QueryBase.md
@@ -15,12 +15,11 @@ Common methods supported by all query types

 ## Extended by

- [`Query`](Query.md)
- [`VectorQuery`](VectorQuery.md)
+- [`TakeQuery`](TakeQuery.md)

 ## Type Parameters

-• **NativeQueryType** *extends* `NativeQuery` \| `NativeVectorQuery`
+• **NativeQueryType** *extends* `NativeQuery` \| `NativeVectorQuery` \| `NativeTakeQuery`

 ## Implements

@@ -141,104 +140,6 @@ const plan = await table.query().nearestTo([0.5, 0.2]).explainPlan();

 ***

-### fastSearch()
-
-```ts
-fastSearch(): this
-```
-
-Skip searching un-indexed data. This can make search faster, but will miss
-any data that is not yet indexed.
-
-Use [Table#optimize](Table.md#optimize) to index all un-indexed data.
-
-#### Returns
-
-`this`
-
-***
-
-### ~~filter()~~
-
-```ts
-filter(predicate): this
-```
-
-A filter statement to be applied to this query.
-
-#### Parameters
-
-* **predicate**: `string`
-
-#### Returns
-
-`this`
-
-#### See
-
-where
-
-#### Deprecated
-
-Use `where` instead
-
-***
-
-### fullTextSearch()
-
-```ts
-fullTextSearch(query, options?): this
-```
-
-#### Parameters
-
-* **query**: `string` \| [`FullTextQuery`](../interfaces/FullTextQuery.md)
-
-* **options?**: `Partial`&lt;[`FullTextSearchOptions`](../interfaces/FullTextSearchOptions.md)&gt;
-
-#### Returns
-
-`this`
-
-***
-
-### limit()
-
-```ts
-limit(limit): this
-```
-
-Set the maximum number of results to return.
-
-By default, a plain search has no limit.  If this method is not
-called then every valid row from the table will be returned.
-
-#### Parameters
-
-* **limit**: `number`
-
-#### Returns
-
-`this`
-
-***
-
-### offset()
-
-```ts
-offset(offset): this
-```
-
-#### Parameters
-
-* **offset**: `number`
-
-#### Returns
-
-`this`
-
-***
-
 ### select()

 ```ts
@@ -328,37 +229,6 @@ ArrowTable.

 ***

-### where()
-
-```ts
-where(predicate): this
-```
-
-A filter statement to be applied to this query.
-
-The filter should be supplied as an SQL query string.  For example:
-
-#### Parameters
-
-* **predicate**: `string`
-
-#### Returns
-
-`this`
-
-#### Example
-
-```ts
-x > 10
-y > 0 AND y < 100
-x > 5 OR y = 'test'
-
-Filtering performance can often be improved by creating a scalar index
-on the filter column(s).
-```
-
-***
-
 ### withRowId()

 ```ts
--- a/docs/src/js/classes/Session.md
+++ b/docs/src/js/classes/Session.md
@@ -9,7 +9,8 @@
 A session for managing caches and object stores across LanceDB operations.

 Sessions allow you to configure cache sizes for index and metadata caches,
-which can significantly impact performance for large datasets.
+which can significantly impact memory use and performance. They can
+also be re-used across multiple connections to share the same cache state.

 ## Constructors

@@ -24,8 +25,11 @@ Create a new session with custom cache sizes.
 # Parameters

 - `index_cache_size_bytes`: The size of the index cache in bytes.
+  Index data is stored in memory in this cache to speed up queries.
  Defaults to 6GB if not specified.
 - `metadata_cache_size_bytes`: The size of the metadata cache in bytes.
+  The metadata cache stores file metadata and schema information in memory.
+  This cache improves scan and write performance.
  Defaults to 1GB if not specified.

 #### Parameters
--- a/docs/src/js/classes/Table.md
+++ b/docs/src/js/classes/Table.md
@@ -674,6 +674,48 @@ console.log(tags); // { "v1": { version: 1, manifestSize: ... } }

 ***

+### takeOffsets()
+
+```ts
+abstract takeOffsets(offsets): TakeQuery
+```
+
+Create a query that returns a subset of the rows in the table.
+
+#### Parameters
+
+* **offsets**: `number`[]
+    The offsets of the rows to return.
+
+#### Returns
+
+[`TakeQuery`](TakeQuery.md)
+
+A builder that can be used to parameterize the query.
+
+***
+
+### takeRowIds()
+
+```ts
+abstract takeRowIds(rowIds): TakeQuery
+```
+
+Create a query that returns a subset of the rows in the table.
+
+#### Parameters
+
+* **rowIds**: `number`[]
+    The row ids of the rows to return.
+
+#### Returns
+
+[`TakeQuery`](TakeQuery.md)
+
+A builder that can be used to parameterize the query.
+
+***
+
 ### toArrow()

 ```ts
--- a/docs/src/js/classes/TakeQuery.md
+++ b/docs/src/js/classes/TakeQuery.md
@@ -0,0 +1,265 @@
+[**@lancedb/lancedb**](../README.md) • **Docs**
+
+***
+
+[@lancedb/lancedb](../globals.md) / TakeQuery
+
+# Class: TakeQuery
+
+A query that returns a subset of the rows in the table.
+
+## Extends
+
+- [`QueryBase`](QueryBase.md)&lt;`NativeTakeQuery`&gt;
+
+## Properties
+
+### inner
+
+```ts
+protected inner: TakeQuery | Promise<TakeQuery>;
+```
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`inner`](QueryBase.md#inner)
+
+## Methods
+
+### analyzePlan()
+
+```ts
+analyzePlan(): Promise<string>
+```
+
+Executes the query and returns the physical query plan annotated with runtime metrics.
+
+This is useful for debugging and performance analysis, as it shows how the query was executed
+and includes metrics such as elapsed time, rows processed, and I/O statistics.
+
+#### Returns
+
+`Promise`&lt;`string`&gt;
+
+A query execution plan with runtime metrics for each step.
+
+#### Example
+
+```ts
+import * as lancedb from "@lancedb/lancedb"
+
+const db = await lancedb.connect("./.lancedb");
+const table = await db.createTable("my_table", [
+  { vector: [1.1, 0.9], id: "1" },
+]);
+
+const plan = await table.query().nearestTo([0.5, 0.2]).analyzePlan();
+
+Example output (with runtime metrics inlined):
+AnalyzeExec verbose=true, metrics=[]
+ ProjectionExec: expr=[id@3 as id, vector@0 as vector, _distance@2 as _distance], metrics=[output_rows=1, elapsed_compute=3.292µs]
+  Take: columns="vector, _rowid, _distance, (id)", metrics=[output_rows=1, elapsed_compute=66.001µs, batches_processed=1, bytes_read=8, iops=1, requests=1]
+   CoalesceBatchesExec: target_batch_size=1024, metrics=[output_rows=1, elapsed_compute=3.333µs]
+    GlobalLimitExec: skip=0, fetch=10, metrics=[output_rows=1, elapsed_compute=167ns]
+     FilterExec: _distance@2 IS NOT NULL, metrics=[output_rows=1, elapsed_compute=8.542µs]
+      SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST], metrics=[output_rows=1, elapsed_compute=63.25µs, row_replacements=1]
+       KNNVectorDistance: metric=l2, metrics=[output_rows=1, elapsed_compute=114.333µs, output_batches=1]
+        LanceScan: uri=/path/to/data, projection=[vector], row_id=true, row_addr=false, ordered=false, metrics=[output_rows=1, elapsed_compute=103.626µs, bytes_read=549, iops=2, requests=2]
+```
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`analyzePlan`](QueryBase.md#analyzeplan)
+
+***
+
+### execute()
+
+```ts
+protected execute(options?): RecordBatchIterator
+```
+
+Execute the query and return the results as an
+
+#### Parameters
+
+* **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
+
+#### Returns
+
+[`RecordBatchIterator`](RecordBatchIterator.md)
+
+#### See
+
+ - AsyncIterator
+of
+ - RecordBatch.
+
+By default, LanceDb will use many threads to calculate results and, when
+the result set is large, multiple batches will be processed at one time.
+This readahead is limited however and backpressure will be applied if this
+stream is consumed slowly (this constrains the maximum memory used by a
+single query)
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`execute`](QueryBase.md#execute)
+
+***
+
+### explainPlan()
+
+```ts
+explainPlan(verbose): Promise<string>
+```
+
+Generates an explanation of the query execution plan.
+
+#### Parameters
+
+* **verbose**: `boolean` = `false`
+    If true, provides a more detailed explanation. Defaults to false.
+
+#### Returns
+
+`Promise`&lt;`string`&gt;
+
+A Promise that resolves to a string containing the query execution plan explanation.
+
+#### Example
+
+```ts
+import * as lancedb from "@lancedb/lancedb"
+const db = await lancedb.connect("./.lancedb");
+const table = await db.createTable("my_table", [
+  { vector: [1.1, 0.9], id: "1" },
+]);
+const plan = await table.query().nearestTo([0.5, 0.2]).explainPlan();
+```
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`explainPlan`](QueryBase.md#explainplan)
+
+***
+
+### select()
+
+```ts
+select(columns): this
+```
+
+Return only the specified columns.
+
+By default a query will return all columns from the table.  However, this can have
+a very significant impact on latency.  LanceDb stores data in a columnar fashion.  This
+means we can finely tune our I/O to select exactly the columns we need.
+
+As a best practice you should always limit queries to the columns that you need.  If you
+pass in an array of column names then only those columns will be returned.
+
+You can also use this method to create new "dynamic" columns based on your existing columns.
+For example, you may not care about "a" or "b" but instead simply want "a + b".  This is often
+seen in the SELECT clause of an SQL query (e.g. `SELECT a+b FROM my_table`).
+
+To create dynamic columns you can pass in a Map<string, string>.  A column will be returned
+for each entry in the map.  The key provides the name of the column.  The value is
+an SQL string used to specify how the column is calculated.
+
+For example, an SQL query might state `SELECT a + b AS combined, c`.  The equivalent
+input to this method would be:
+
+#### Parameters
+
+* **columns**: `string` \| `string`[] \| `Record`&lt;`string`, `string`&gt; \| `Map`&lt;`string`, `string`&gt;
+
+#### Returns
+
+`this`
+
+#### Example
+
+```ts
+new Map([["combined", "a + b"], ["c", "c"]])
+
+Columns will always be returned in the order given, even if that order is different than
+the order used when adding the data.
+
+Note that you can pass in a `Record<string, string>` (e.g. an object literal). This method
+uses `Object.entries` which should preserve the insertion order of the object.  However,
+object insertion order is easy to get wrong and `Map` is more foolproof.
+```
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`select`](QueryBase.md#select)
+
+***
+
+### toArray()
+
+```ts
+toArray(options?): Promise<any[]>
+```
+
+Collect the results as an array of objects.
+
+#### Parameters
+
+* **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
+
+#### Returns
+
+`Promise`&lt;`any`[]&gt;
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`toArray`](QueryBase.md#toarray)
+
+***
+
+### toArrow()
+
+```ts
+toArrow(options?): Promise<Table<any>>
+```
+
+Collect the results as an Arrow
+
+#### Parameters
+
+* **options?**: `Partial`&lt;[`QueryExecutionOptions`](../interfaces/QueryExecutionOptions.md)&gt;
+
+#### Returns
+
+`Promise`&lt;`Table`&lt;`any`&gt;&gt;
+
+#### See
+
+ArrowTable.
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`toArrow`](QueryBase.md#toarrow)
+
+***
+
+### withRowId()
+
+```ts
+withRowId(): this
+```
+
+Whether to return the row id in the results.
+
+This column can be used to match results between different queries. For
+example, to match results from a full text search and a vector search in
+order to perform hybrid search.
+
+#### Returns
+
+`this`
+
+#### Inherited from
+
+[`QueryBase`](QueryBase.md).[`withRowId`](QueryBase.md#withrowid)
--- a/docs/src/js/classes/VectorQuery.md
+++ b/docs/src/js/classes/VectorQuery.md
@@ -16,7 +16,7 @@ This builder can be reused to execute the query many times.

 ## Extends

- [`QueryBase`](QueryBase.md)&lt;`NativeVectorQuery`&gt;
+- `StandardQueryBase`&lt;`NativeVectorQuery`&gt;

 ## Properties

@@ -28,7 +28,7 @@ protected inner: VectorQuery | Promise<VectorQuery>;

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`inner`](QueryBase.md#inner)
+`StandardQueryBase.inner`

 ## Methods

@@ -91,7 +91,7 @@ AnalyzeExec verbose=true, metrics=[]

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`analyzePlan`](QueryBase.md#analyzeplan)
+`StandardQueryBase.analyzePlan`

 ***

@@ -248,7 +248,7 @@ single query)

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`execute`](QueryBase.md#execute)
+`StandardQueryBase.execute`

 ***

@@ -284,7 +284,7 @@ const plan = await table.query().nearestTo([0.5, 0.2]).explainPlan();

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`explainPlan`](QueryBase.md#explainplan)
+`StandardQueryBase.explainPlan`

 ***

@@ -305,7 +305,7 @@ Use [Table#optimize](Table.md#optimize) to index all un-indexed data.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`fastSearch`](QueryBase.md#fastsearch)
+`StandardQueryBase.fastSearch`

 ***

@@ -335,7 +335,7 @@ Use `where` instead

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`filter`](QueryBase.md#filter)
+`StandardQueryBase.filter`

 ***

@@ -357,7 +357,7 @@ fullTextSearch(query, options?): this

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`fullTextSearch`](QueryBase.md#fulltextsearch)
+`StandardQueryBase.fullTextSearch`

 ***

@@ -382,7 +382,7 @@ called then every valid row from the table will be returned.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`limit`](QueryBase.md#limit)
+`StandardQueryBase.limit`

 ***

@@ -480,6 +480,10 @@ the minimum and maximum to the same value.
 offset(offset): this
 ```

+Set the number of rows to skip before returning results.
+
+This is useful for pagination.
+
 #### Parameters

 * **offset**: `number`
@@ -490,7 +494,7 @@ offset(offset): this

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`offset`](QueryBase.md#offset)
+`StandardQueryBase.offset`

 ***

@@ -637,7 +641,7 @@ object insertion order is easy to get wrong and `Map` is more foolproof.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`select`](QueryBase.md#select)
+`StandardQueryBase.select`

 ***

@@ -659,7 +663,7 @@ Collect the results as an array of objects.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`toArray`](QueryBase.md#toarray)
+`StandardQueryBase.toArray`

 ***

@@ -685,7 +689,7 @@ ArrowTable.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`toArrow`](QueryBase.md#toarrow)
+`StandardQueryBase.toArrow`

 ***

@@ -720,7 +724,7 @@ on the filter column(s).

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`where`](QueryBase.md#where)
+`StandardQueryBase.where`

 ***

@@ -742,4 +746,4 @@ order to perform hybrid search.

 #### Inherited from

-[`QueryBase`](QueryBase.md).[`withRowId`](QueryBase.md#withrowid)
+`StandardQueryBase.withRowId`
--- a/docs/src/js/globals.md
+++ b/docs/src/js/globals.md
@@ -33,6 +33,7 @@
 - [Table](classes/Table.md)
 - [TagContents](classes/TagContents.md)
 - [Tags](classes/Tags.md)
+- [TakeQuery](classes/TakeQuery.md)
 - [VectorColumnOptions](classes/VectorColumnOptions.md)
 - [VectorQuery](classes/VectorQuery.md)

--- a/docs/src/js/interfaces/TimeoutConfig.md
+++ b/docs/src/js/interfaces/TimeoutConfig.md
@@ -44,3 +44,17 @@ optional readTimeout: number;
 The timeout for reading data from the server in seconds. Default is 300
 seconds (5 minutes). This can also be set via the environment variable
 `LANCE_CLIENT_READ_TIMEOUT`, as an integer number of seconds.
+
+***
+
+### timeout?
+
+```ts
+optional timeout: number;
+```
+
+The overall timeout for the entire request in seconds. This includes
+connection, send, and read time. If the entire request doesn't complete
+within this time, it will fail. Default is None (no overall timeout).
+This can also be set via the environment variable `LANCE_CLIENT_TIMEOUT`,
+as an integer number of seconds.