mirror of
https://github.com/lancedb/lancedb.git
synced 2025-12-26 22:59:57 +00:00
feat: add take_offsets and take_row_ids (#2584)
These operations have existed in lance for a long while and many users need to drop down to lance for this capability. This PR adds the API and implements it using filters (e.g. `_rowid IN (...)`) so that in doesn't currently add any load to `BaseTable`. I'm not sure that is sustainable as base table implementations may want to specialize how they handle this method. However, I figure it is a good starting point. In addition, unlike Lance, this API does not currently guarantee anything about the order of the take results. This is necessary for the fallback filter approach to work (SQL filters cannot guarantee result order)
This commit is contained in:
@@ -1327,6 +1327,34 @@ def test_query_timeout(tmp_path):
|
||||
)
|
||||
|
||||
|
||||
def test_take_queries(tmp_path):
|
||||
db = lancedb.connect(tmp_path)
|
||||
data = pa.table(
|
||||
{
|
||||
"idx": range(100),
|
||||
}
|
||||
)
|
||||
table = db.create_table("test", data)
|
||||
|
||||
# Take by offset
|
||||
assert list(
|
||||
sorted(table.take_offsets([5, 2, 17]).to_pandas()["idx"].to_list())
|
||||
) == [
|
||||
2,
|
||||
5,
|
||||
17,
|
||||
]
|
||||
|
||||
# Take by row id
|
||||
assert list(
|
||||
sorted(table.take_row_ids([5, 2, 17]).to_pandas()["idx"].to_list())
|
||||
) == [
|
||||
2,
|
||||
5,
|
||||
17,
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_query_timeout_async(tmp_path):
|
||||
db = await lancedb.connect_async(tmp_path)
|
||||
|
||||
Reference in New Issue
Block a user