Compare commits

..

23 Commits

Author SHA1 Message Date
ayush chaurasia
887ac0d79d Merge branch 'main' of https://github.com/lancedb/lancedb into hybrid_query 2024-03-01 11:14:06 +05:30
QianZhu
c1af53b787 Add create scalar index to sdk (#1033) 2024-02-29 13:32:01 -08:00
Weston Pace
2a02d1394b feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-02-29 13:29:29 -08:00
Lance Release
085066d2a8 [python] Bump version: 0.6.0 → 0.6.1 2024-02-29 19:48:16 +00:00
Rob Meng
adf1a38f4d fix: fix columns type for pydantic 2.x (#1045) 2024-02-29 14:47:56 -05:00
Weston Pace
294c33a42e feat: Initial remote table implementation for rust (#1024)
This will eventually replace the remote table implementations in python
and node.
2024-02-29 10:55:49 -08:00
Lance Release
245786fed7 [python] Bump version: 0.5.7 → 0.6.0 2024-02-29 16:03:01 +00:00
BubbleCal
edd9a043f8 chore: enable test for dropping table (#1038)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-02-29 15:00:24 +08:00
natcharacter
38c09fc294 A simple base usage that install the dependencies necessary to use FT… (#1036)
A simple base usage that install the dependencies necessary to use FTS
and Hybrid search

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-02-28 09:38:05 -08:00
Rob Meng
ebaa2dede5 chore: upgrade to lance 0.10.1 (#1034)
upgrade to lance 0.10.1 and update doc string to reflect dynamic
projection options
2024-02-28 11:06:46 -05:00
BubbleCal
ba7618a026 chore(rust): report the TableNotFound error while dropping non-exist table (#1022)
this will work after upgrading lance with
https://github.com/lancedb/lance/pull/1995 merged
see #884 for details

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-02-28 04:46:39 -08:00
Weston Pace
a6bcbd007b feat: add a basic async python client starting point (#1014)
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.

The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.

Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.
2024-02-27 04:52:02 -08:00
Will Jones
5af74b5aca feat: {add|alter|drop}_columns APIs (#1015)
Initial work for #959. This exposes the basic functionality for each in
all of the APIs. Will add user guide documentation in a later PR.
2024-02-26 11:04:53 -08:00
Weston Pace
8a52619bc0 refactor: change arrow from a direct dependency to a peer dependency (#984)
BREAKING CHANGE: users will now need to npm install `apache-arrow` and
`@apache-arrow/ts` themselves.
2024-02-23 14:08:39 -08:00
ayush chaurasia
3ad4992282 update 2024-02-23 14:11:58 +05:30
ayush chaurasia
51cc422799 update 2024-02-23 14:03:21 +05:30
ayush chaurasia
a696dbc8f4 update 2024-02-23 13:54:44 +05:30
Lance Release
314d4c93e5 Updating package-lock.json 2024-02-23 05:11:22 +00:00
Lance Release
c5471ee694 Updating package-lock.json 2024-02-23 03:57:39 +00:00
ayush chaurasia
9ca0260d54 update 2024-02-23 03:03:39 +05:30
ayush chaurasia
6486ec870b update 2024-02-23 03:02:05 +05:30
ayush chaurasia
64db2393f7 update 2024-02-22 16:28:17 +05:30
ayush chaurasia
bd4e8341fe update 2024-02-21 21:43:23 +05:30
118 changed files with 3846 additions and 347 deletions

View File

@@ -0,0 +1,58 @@
# We create a composite action to be re-used both for testing and for releasing
name: build-linux-wheel
description: "Build a manylinux wheel for lance"
inputs:
python-minor-version:
description: "8, 9, 10, 11, 12"
required: true
args:
description: "--release"
required: false
default: ""
arm-build:
description: "Build for arm64 instead of x86_64"
# Note: this does *not* mean the host is arm64, since we might be cross-compiling.
required: false
default: "false"
runs:
using: "composite"
steps:
- name: CONFIRM ARM BUILD
shell: bash
run: |
echo "ARM BUILD: ${{ inputs.arm-build }}"
- name: Build x86_64 Manylinux wheel
if: ${{ inputs.arm-build == 'false' }}
uses: PyO3/maturin-action@v1
with:
command: build
working-directory: python
target: x86_64-unknown-linux-gnu
manylinux: "2_17"
args: ${{ inputs.args }}
before-script-linux: |
set -e
yum install -y openssl-devel \
&& curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$(uname -m).zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip
- name: Build Arm Manylinux Wheel
if: ${{ inputs.arm-build == 'true' }}
uses: PyO3/maturin-action@v1
with:
command: build
working-directory: python
target: aarch64-unknown-linux-gnu
manylinux: "2_24"
args: ${{ inputs.args }}
before-script-linux: |
set -e
apt install -y unzip
if [ $(uname -m) = "x86_64" ]; then
PROTOC_ARCH="x86_64"
else
PROTOC_ARCH="aarch_64"
fi
curl -L https://github.com/protocolbuffers/protobuf/releases/download/v24.4/protoc-24.4-linux-$PROTOC_ARCH.zip > /tmp/protoc.zip \
&& unzip /tmp/protoc.zip -d /usr/local \
&& rm /tmp/protoc.zip

View File

@@ -0,0 +1,25 @@
# We create a composite action to be re-used both for testing and for releasing
name: build_wheel
description: "Build a lance wheel"
inputs:
python-minor-version:
description: "8, 9, 10, 11"
required: true
args:
description: "--release"
required: false
default: ""
runs:
using: "composite"
steps:
- name: Install macos dependency
shell: bash
run: |
brew install protobuf
- name: Build wheel
uses: PyO3/maturin-action@v1
with:
command: build
args: ${{ inputs.args }}
working-directory: python
interpreter: 3.${{ inputs.python-minor-version }}

View File

@@ -0,0 +1,33 @@
# We create a composite action to be re-used both for testing and for releasing
name: build_wheel
description: "Build a lance wheel"
inputs:
python-minor-version:
description: "8, 9, 10, 11"
required: true
args:
description: "--release"
required: false
default: ""
runs:
using: "composite"
steps:
- name: Install Protoc v21.12
working-directory: C:\
run: |
New-Item -Path 'C:\protoc' -ItemType Directory
Set-Location C:\protoc
Invoke-WebRequest https://github.com/protocolbuffers/protobuf/releases/download/v21.12/protoc-21.12-win64.zip -OutFile C:\protoc\protoc.zip
7z x protoc.zip
Add-Content $env:GITHUB_PATH "C:\protoc\bin"
shell: powershell
- name: Build wheel
uses: PyO3/maturin-action@v1
with:
command: build
args: ${{ inputs.args }}
working-directory: python
- uses: actions/upload-artifact@v3
with:
name: windows-wheels
path: python\target\wheels

View File

@@ -2,30 +2,91 @@ name: PyPI Publish
on:
release:
types: [ published ]
types: [published]
jobs:
publish:
runs-on: ubuntu-latest
# Only runs on tags that matches the python-make-release action
if: startsWith(github.ref, 'refs/tags/python-v')
defaults:
run:
shell: bash
working-directory: python
linux:
timeout-minutes: 60
strategy:
matrix:
python-minor-version: ["8"]
platform:
- x86_64
- aarch64
runs-on: "ubuntu-22.04"
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v4
with:
python-version: "3.8"
- name: Build distribution
run: |
ls -la
pip install wheel setuptools --upgrade
python setup.py sdist bdist_wheel
- name: Publish
uses: pypa/gh-action-pypi-publish@v1.8.5
python-version: 3.${{ matrix.python-minor-version }}
- uses: ./.github/workflows/build_linux_wheel
with:
password: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
packages-dir: python/dist
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip"
arm-build: ${{ matrix.platform == 'aarch64' }}
- uses: ./.github/workflows/upload_wheel
with:
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
mac:
timeout-minutes: 60
runs-on: ${{ matrix.config.runner }}
strategy:
matrix:
python-minor-version: ["8"]
config:
- target: x86_64-apple-darwin
runner: macos-13
- target: aarch64-apple-darwin
runner: macos-14
env:
MACOSX_DEPLOYMENT_TARGET: 10.15
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.12
- uses: ./.github/workflows/build_mac_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip --target ${{ matrix.config.target }}"
- uses: ./.github/workflows/upload_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"
windows:
timeout-minutes: 60
runs-on: windows-latest
strategy:
matrix:
python-minor-version: ["8"]
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.${{ matrix.python-minor-version }}
- uses: ./.github/workflows/build_windows_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
args: "--release --strip"
vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }}
- uses: ./.github/workflows/upload_wheel
with:
python-minor-version: ${{ matrix.python-minor-version }}
token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
repo: "pypi"

View File

@@ -14,49 +14,133 @@ concurrency:
cancel-in-progress: true
jobs:
linux:
lint:
name: "Lint"
timeout-minutes: 30
strategy:
matrix:
python-minor-version: [ "8", "11" ]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.${{ matrix.python-minor-version }}
- name: Install lancedb
run: |
pip install -e .[tests]
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
pip install pytest pytest-mock ruff
- name: Format check
run: ruff format --check .
- name: Lint
run: ruff .
- name: Run tests
run: pytest -m "not slow" -x -v --durations=30 tests
- name: doctest
run: pytest --doctest-modules lancedb
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install ruff
run: |
pip install ruff==0.2.2
- name: Format check
run: ruff format --check .
- name: Lint
run: ruff .
doctest:
name: "Doctest"
timeout-minutes: 30
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install protobuf
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- name: Install
run: |
pip install -e .[tests,dev,embeddings]
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
pip install mlx
- name: Doctest
run: pytest --doctest-modules python/lancedb
linux:
name: "Linux: python-3.${{ matrix.python-minor-version }}"
timeout-minutes: 30
strategy:
matrix:
python-minor-version: ["8", "11"]
runs-on: "ubuntu-22.04"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Install protobuf
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.${{ matrix.python-minor-version }}
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_linux_wheel
- uses: ./.github/workflows/run_tests
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
platform:
name: "Platform: ${{ matrix.config.name }}"
name: "Mac: ${{ matrix.config.name }}"
timeout-minutes: 30
strategy:
matrix:
config:
- name: x86 Mac
- name: x86
runner: macos-13
- name: Arm Mac
- name: Arm
runner: macos-14
- name: x86 Windows
runs-on: "${{ matrix.config.runner }}"
defaults:
run:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_mac_wheel
- uses: ./.github/workflows/run_tests
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
windows:
name: "Windows: ${{ matrix.config.name }}"
timeout-minutes: 30
strategy:
matrix:
config:
- name: x86
runner: windows-latest
runs-on: "${{ matrix.config.runner }}"
defaults:
@@ -64,21 +148,22 @@ jobs:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install lancedb
run: |
pip install -e .[tests]
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
pip install pytest pytest-mock
- name: Run tests
run: pytest -m "not slow" -x -v --durations=30 tests
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- uses: Swatinem/rust-cache@v2
with:
workspaces: python
- uses: ./.github/workflows/build_windows_wheel
- uses: ./.github/workflows/run_tests
# Make sure wheels are not included in the Rust cache
- name: Delete wheels
run: rm -rf target/wheels
pydantic1x:
timeout-minutes: 30
runs-on: "ubuntu-22.04"
@@ -87,21 +172,22 @@ jobs:
shell: bash
working-directory: python
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Install lancedb
run: |
pip install "pydantic<2"
pip install -e .[tests]
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
pip install pytest pytest-mock
- name: Run tests
run: pytest -m "not slow" -x -v --durations=30 tests
- name: doctest
run: pytest --doctest-modules lancedb
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Install lancedb
run: |
pip install "pydantic<2"
pip install -e .[tests]
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
- name: Run tests
run: pytest -m "not slow" -x -v --durations=30 python/tests

View File

@@ -0,0 +1,37 @@
name: LanceDb Cloud Integration Test
on:
workflow_run:
workflows: [Rust]
types:
- completed
env:
LANCEDB_PROJECT: ${{ secrets.LANCEDB_PROJECT }}
LANCEDB_API_KEY: ${{ secrets.LANCEDB_API_KEY }}
LANCEDB_REGION: ${{ secrets.LANCEDB_REGION }}
jobs:
test:
timeout-minutes: 30
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: rust
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: true
- uses: Swatinem/rust-cache@v2
with:
workspaces: rust
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev
- name: Build
run: cargo build --all-features
- name: Run Integration test
run: cargo test --tests -- --ignored

17
.github/workflows/run_tests/action.yml vendored Normal file
View File

@@ -0,0 +1,17 @@
name: run-tests
description: "Install lance wheel and run unit tests"
inputs:
python-minor-version:
required: true
description: "8 9 10 11 12"
runs:
using: "composite"
steps:
- name: Install lancedb
shell: bash
run: |
pip3 install $(ls target/wheels/lancedb-*.whl)[tests,dev]
- name: pytest
shell: bash
run: pytest -m "not slow" -x -v --durations=30 python/python/tests

View File

@@ -119,3 +119,4 @@ jobs:
$env:VCPKG_ROOT = $env:VCPKG_INSTALLATION_ROOT
cargo build
cargo test

View File

@@ -0,0 +1,29 @@
name: upload-wheel
description: "Upload wheels to Pypi"
inputs:
os:
required: true
description: "ubuntu-22.04 or macos-13"
repo:
required: false
description: "pypi or testpypi"
default: "pypi"
token:
required: true
description: "release token for the repo"
runs:
using: "composite"
steps:
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install twine
- name: Publish wheel
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ inputs.token }}
shell: bash
run: twine upload --repository ${{ inputs.repo }} target/wheels/lancedb-*.whl

7
.gitignore vendored
View File

@@ -22,6 +22,11 @@ python/dist
**/.hypothesis
# Compiled Dynamic libraries
*.so
*.dylib
*.dll
## Javascript
*.node
**/node_modules
@@ -34,4 +39,6 @@ dist
## Rust
target
**/sccache.log
Cargo.lock

View File

@@ -5,17 +5,8 @@ repos:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.12.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.0.277
rev: v0.2.2
hooks:
- id: ruff
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
name: isort (python)

View File

@@ -1,5 +1,5 @@
[workspace]
members = ["rust/ffi/node", "rust/lancedb", "nodejs"]
members = ["rust/ffi/node", "rust/lancedb", "nodejs", "python"]
# Python package needs to be built by maturin.
exclude = ["python"]
resolver = "2"
@@ -14,10 +14,10 @@ keywords = ["lancedb", "lance", "database", "vector", "search"]
categories = ["database-implementations"]
[workspace.dependencies]
lance = { "version" = "=0.9.18", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.9.18" }
lance-linalg = { "version" = "=0.9.18" }
lance-testing = { "version" = "=0.9.18" }
lance = { "version" = "=0.10.1", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.1" }
lance-linalg = { "version" = "=0.10.1" }
lance-testing = { "version" = "=0.10.1" }
# Note that this one does not include pyarrow
arrow = { version = "50.0", optional = false }
arrow-array = "50.0"

27
dockerfiles/Dockerfile Normal file
View File

@@ -0,0 +1,27 @@
#Simple base dockerfile that supports basic dependencies required to run lance with FTS and Hybrid Search
#Usage docker build -t lancedb:latest -f Dockerfile .
FROM python:3.10-slim-buster
# Install Rust
RUN apt-get update && apt-get install -y curl build-essential && \
curl https://sh.rustup.rs -sSf | sh -s -- -y
# Set the environment variable for Rust
ENV PATH="/root/.cargo/bin:${PATH}"
# Install protobuf compiler
RUN apt-get install -y protobuf-compiler && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN apt-get -y update &&\
apt-get -y upgrade && \
apt-get -y install git
# Verify installations
RUN python --version && \
rustc --version && \
protoc --version
RUN pip install tantivy lancedb

44
node/package-lock.json generated
View File

@@ -1,12 +1,12 @@
{
"name": "vectordb",
"version": "0.4.10",
"version": "0.4.11",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "vectordb",
"version": "0.4.10",
"version": "0.4.11",
"cpu": [
"x64",
"arm64"
@@ -53,11 +53,11 @@
"uuid": "^9.0.0"
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.4.10",
"@lancedb/vectordb-darwin-x64": "0.4.10",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.10",
"@lancedb/vectordb-linux-x64-gnu": "0.4.10",
"@lancedb/vectordb-win32-x64-msvc": "0.4.10"
"@lancedb/vectordb-darwin-arm64": "0.4.11",
"@lancedb/vectordb-darwin-x64": "0.4.11",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.11",
"@lancedb/vectordb-linux-x64-gnu": "0.4.11",
"@lancedb/vectordb-win32-x64-msvc": "0.4.11"
}
},
"node_modules/@75lb/deep-merge": {
@@ -329,9 +329,9 @@
}
},
"node_modules/@lancedb/vectordb-darwin-arm64": {
"version": "0.4.10",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-arm64/-/vectordb-darwin-arm64-0.4.10.tgz",
"integrity": "sha512-y/uHOGb0g15pvqv5tdTyZ6oN+0QVpBmZDzKFWW6pPbuSZjB2uPqcs+ti0RB+AUdmS21kavVQqaNsw/HLKEGrHA==",
"version": "0.4.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-arm64/-/vectordb-darwin-arm64-0.4.11.tgz",
"integrity": "sha512-JDOKmFnuJPFkA7ZmrzBJolROwSjWr7yMvAbi40uLBc25YbbVezodd30u2EFtIwWwtk1GqNYRZ49FZOElKYeC/Q==",
"cpu": [
"arm64"
],
@@ -341,9 +341,9 @@
]
},
"node_modules/@lancedb/vectordb-darwin-x64": {
"version": "0.4.10",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-x64/-/vectordb-darwin-x64-0.4.10.tgz",
"integrity": "sha512-XbfR58OkQpAe0xMSTrwJh9ZjGSzG9EZ7zwO6HfYem8PxcLYAcC6eWRWoSG/T0uObyrPTcYYyvHsp0eNQWYBFAQ==",
"version": "0.4.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-darwin-x64/-/vectordb-darwin-x64-0.4.11.tgz",
"integrity": "sha512-iy6r+8tp2v1EFgJV52jusXtxgO6NY6SkpOdX41xPqN2mQWMkfUAR9Xtks1mgknjPOIKH4MRc8ZS0jcW/UWmilQ==",
"cpu": [
"x64"
],
@@ -353,9 +353,9 @@
]
},
"node_modules/@lancedb/vectordb-linux-arm64-gnu": {
"version": "0.4.10",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-arm64-gnu/-/vectordb-linux-arm64-gnu-0.4.10.tgz",
"integrity": "sha512-x40WKH9b+KxorRmKr9G7fv8p5mMj8QJQvRMA0v6v+nbZHr2FLlAZV+9mvhHOnm4AGIkPP5335cUgv6Qz6hgwkQ==",
"version": "0.4.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-arm64-gnu/-/vectordb-linux-arm64-gnu-0.4.11.tgz",
"integrity": "sha512-5K6IVcTMuH0SZBjlqB5Gg39WC889FpTwIWKufxzQMMXrzxo5J3lKUHVoR28RRlNhDF2d9kZXBEyCpIfDFsV9iQ==",
"cpu": [
"arm64"
],
@@ -365,9 +365,9 @@
]
},
"node_modules/@lancedb/vectordb-linux-x64-gnu": {
"version": "0.4.10",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-x64-gnu/-/vectordb-linux-x64-gnu-0.4.10.tgz",
"integrity": "sha512-CTGPpuzlqq2nVjUxI9gAJOT1oBANIovtIaFsOmBSnEAHgX7oeAxKy2b6L/kJzsgqSzvR5vfLwYcWFrr6ZmBxSA==",
"version": "0.4.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-linux-x64-gnu/-/vectordb-linux-x64-gnu-0.4.11.tgz",
"integrity": "sha512-hF9ZChsdqKqqnivOzd9mE7lC3PmhZadXtwThi2RrsPiOLoEaGDfmr6Ni3amVQnB3bR8YEJtTxdQxe0NC4uW/8g==",
"cpu": [
"x64"
],
@@ -377,9 +377,9 @@
]
},
"node_modules/@lancedb/vectordb-win32-x64-msvc": {
"version": "0.4.10",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-win32-x64-msvc/-/vectordb-win32-x64-msvc-0.4.10.tgz",
"integrity": "sha512-Fd7r74coZyrKzkfXg4WthqOL+uKyJyPTia6imcrMNqKOlTGdKmHf02Qi2QxWZrFaabkRYo4Tpn5FeRJ3yYX8CA==",
"version": "0.4.11",
"resolved": "https://registry.npmjs.org/@lancedb/vectordb-win32-x64-msvc/-/vectordb-win32-x64-msvc-0.4.11.tgz",
"integrity": "sha512-0+9ut1ccKoqIyGxsVixwx3771Z+DXpl5WfSmOeA8kf3v3jlOg2H+0YUahiXLDid2ju+yeLPrAUYm7A1gKHVhew==",
"cpu": [
"x64"
],

View File

@@ -61,11 +61,13 @@
"uuid": "^9.0.0"
},
"dependencies": {
"@apache-arrow/ts": "^14.0.2",
"@neon-rs/load": "^0.0.74",
"apache-arrow": "^14.0.2",
"axios": "^1.4.0"
},
"peerDependencies": {
"@apache-arrow/ts": "^14.0.2",
"apache-arrow": "^14.0.2"
},
"os": [
"darwin",
"linux",

View File

@@ -42,7 +42,10 @@ const {
tableCompactFiles,
tableListIndices,
tableIndexStats,
tableSchema
tableSchema,
tableAddColumns,
tableAlterColumns,
tableDropColumns
// eslint-disable-next-line @typescript-eslint/no-var-requires
} = require('../native.js')
@@ -338,6 +341,7 @@ export interface Table<T = number[]> {
*
* @param column The column to index
* @param replace If false, fail if an index already exists on the column
* it is always set to true for remote connections
*
* Scalar indices, like vector indices, can be used to speed up scans. A scalar
* index can speed up scans that contain filter expressions on the indexed column.
@@ -381,7 +385,7 @@ export interface Table<T = number[]> {
* await table.createScalarIndex('my_col')
* ```
*/
createScalarIndex: (column: string, replace: boolean) => Promise<void>
createScalarIndex: (column: string, replace?: boolean) => Promise<void>
/**
* Returns the number of rows in this table.
@@ -500,6 +504,59 @@ export interface Table<T = number[]> {
filter(value: string): Query<T>
schema: Promise<Schema>
// TODO: Support BatchUDF
/**
* Add new columns with defined values.
*
* @param newColumnTransforms pairs of column names and the SQL expression to use
* to calculate the value of the new column. These
* expressions will be evaluated for each row in the
* table, and can reference existing columns in the table.
*/
addColumns(newColumnTransforms: Array<{ name: string, valueSql: string }>): Promise<void>
/**
* Alter the name or nullability of columns.
*
* @param columnAlterations One or more alterations to apply to columns.
*/
alterColumns(columnAlterations: ColumnAlteration[]): Promise<void>
/**
* Drop one or more columns from the dataset
*
* This is a metadata-only operation and does not remove the data from the
* underlying storage. In order to remove the data, you must subsequently
* call ``compact_files`` to rewrite the data without the removed columns and
* then call ``cleanup_files`` to remove the old files.
*
* @param columnNames The names of the columns to drop. These can be nested
* column references (e.g. "a.b.c") or top-level column
* names (e.g. "a").
*/
dropColumns(columnNames: string[]): Promise<void>
}
/**
* A definition of a column alteration. The alteration changes the column at
* `path` to have the new name `name`, to be nullable if `nullable` is true,
* and to have the data type `data_type`. At least one of `rename` or `nullable`
* must be provided.
*/
export interface ColumnAlteration {
/**
* The path to the column to alter. This is a dot-separated path to the column.
* If it is a top-level column then it is just the name of the column. If it is
* a nested column then it is the path to the column, e.g. "a.b.c" for a column
* `c` nested inside a column `b` nested inside a column `a`.
*/
path: string
rename?: string
/**
* Set the new nullability. Note that a nullable column cannot be made non-nullable.
*/
nullable?: boolean
}
export interface UpdateArgs {
@@ -858,7 +915,10 @@ export class LocalTable<T = number[]> implements Table<T> {
})
}
async createScalarIndex (column: string, replace: boolean): Promise<void> {
async createScalarIndex (column: string, replace?: boolean): Promise<void> {
if (replace === undefined) {
replace = true
}
return tableCreateScalarIndex.call(this._tbl, column, replace)
}
@@ -1028,6 +1088,18 @@ export class LocalTable<T = number[]> implements Table<T> {
return false
}
}
async addColumns (newColumnTransforms: Array<{ name: string, valueSql: string }>): Promise<void> {
return tableAddColumns.call(this._tbl, newColumnTransforms)
}
async alterColumns (columnAlterations: ColumnAlteration[]): Promise<void> {
return tableAlterColumns.call(this._tbl, columnAlterations)
}
async dropColumns (columnNames: string[]): Promise<void> {
return tableDropColumns.call(this._tbl, columnNames)
}
}
export interface CleanupStats {

View File

@@ -25,7 +25,8 @@ import {
type UpdateArgs,
type UpdateSqlArgs,
makeArrowTable,
type MergeInsertArgs
type MergeInsertArgs,
type ColumnAlteration
} from '../index'
import { Query } from '../query'
@@ -396,7 +397,7 @@ export class RemoteTable<T = number[]> implements Table<T> {
}
const column = indexParams.column ?? 'vector'
const indexType = 'vector' // only vector index is supported for remote connections
const indexType = 'vector'
const metricType = indexParams.metric_type ?? 'L2'
const indexCacheSize = indexParams.index_cache_size ?? null
@@ -419,8 +420,25 @@ export class RemoteTable<T = number[]> implements Table<T> {
}
}
async createScalarIndex (column: string, replace: boolean): Promise<void> {
throw new Error('Not implemented')
async createScalarIndex (column: string): Promise<void> {
const indexType = 'scalar'
const data = {
column,
index_type: indexType,
replace: true
}
const res = await this._client.post(
`/v1/table/${this._name}/create_scalar_index/`,
data
)
if (res.status !== 200) {
throw new Error(
`Server Error, status: ${res.status}, ` +
// eslint-disable-next-line @typescript-eslint/restrict-template-expressions
`message: ${res.statusText}: ${res.data}`
)
}
}
async countRows (): Promise<number> {
@@ -474,4 +492,16 @@ export class RemoteTable<T = number[]> implements Table<T> {
numUnindexedRows: results.data.num_unindexed_rows
}
}
async addColumns (newColumnTransforms: Array<{ name: string, valueSql: string }>): Promise<void> {
throw new Error('Add columns is not yet supported in LanceDB Cloud.')
}
async alterColumns (columnAlterations: ColumnAlteration[]): Promise<void> {
throw new Error('Alter columns is not yet supported in LanceDB Cloud.')
}
async dropColumns (columnNames: string[]): Promise<void> {
throw new Error('Drop columns is not yet supported in LanceDB Cloud.')
}
}

View File

@@ -37,8 +37,10 @@ import {
Utf8,
Table as ArrowTable,
vectorFromArray,
Float64,
Float32,
Float16
Float16,
Int64
} from 'apache-arrow'
const expect = chai.expect
@@ -196,7 +198,7 @@ describe('LanceDB client', function () {
const table = await con.openTable('vectors')
const results = await table
.search([0.1, 0.1])
.select(['is_active'])
.select(['is_active', 'vector'])
.execute()
assert.equal(results.length, 2)
// vector and _distance are always returned
@@ -1057,3 +1059,63 @@ describe('Compact and cleanup', function () {
assert.equal(await table.countRows(), 3)
})
})
describe('schema evolution', function () {
// Create a new sample table
it('can add a new column to the schema', async function () {
const dir = await track().mkdir('lancejs')
const con = await lancedb.connect(dir)
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2] }
])
await table.addColumns([{ name: 'price', valueSql: 'cast(10.0 as float)' }])
const expectedSchema = new Schema([
new Field('id', new Int64()),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true))),
new Field('price', new Float32())
])
expect(await table.schema).to.deep.equal(expectedSchema)
})
it('can alter the columns in the schema', async function () {
const dir = await track().mkdir('lancejs')
const con = await lancedb.connect(dir)
const schema = new Schema([
new Field('id', new Int64(), false),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true))),
new Field('price', new Float64(), false)
])
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2], price: 10.0 }
])
expect(await table.schema).to.deep.equal(schema)
await table.alterColumns([
{ path: 'id', rename: 'new_id' },
{ path: 'price', nullable: true }
])
const expectedSchema = new Schema([
new Field('new_id', new Int64(), false),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true))),
new Field('price', new Float64(), true)
])
expect(await table.schema).to.deep.equal(expectedSchema)
})
it('can drop a column from the schema', async function () {
const dir = await track().mkdir('lancejs')
const con = await lancedb.connect(dir)
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2] }
])
await table.dropColumns(['vector'])
const expectedSchema = new Schema([
new Field('id', new Int64(), false)
])
expect(await table.schema).to.deep.equal(expectedSchema)
})
})

View File

@@ -0,0 +1,34 @@
// Copyright 2024 Lance Developers.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
import * as os from "os";
import * as path from "path";
import * as fs from "fs";
import { connect } from "../dist/index.js";
describe("when working with a connection", () => {
const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "test-connection"));
it("should fail if creating table twice, unless overwrite is true", async() => {
const db = await connect(tmpDir);
let tbl = await db.createTable("test", [{ id: 1 }, { id: 2 }]);
await expect(tbl.countRows()).resolves.toBe(2);
await expect(db.createTable("test", [{ id: 1 }, { id: 2 }])).rejects.toThrow();
tbl = await db.createTable("test", [{ id: 3 }], { mode: "overwrite" });
await expect(tbl.countRows()).resolves.toBe(1);
})
});

View File

@@ -17,7 +17,7 @@ import * as path from "path";
import * as fs from "fs";
import { connect } from "../dist";
import { Schema, Field, Float32, Int32, FixedSizeList } from "apache-arrow";
import { Schema, Field, Float32, Int32, FixedSizeList, Int64, Float64 } from "apache-arrow";
import { makeArrowTable } from "../dist/arrow";
describe("Test creating index", () => {
@@ -201,17 +201,82 @@ describe("Read consistency interval", () => {
await table.add([{ id: 2 }]);
if (interval === undefined) {
expect(await table2.countRows()).toEqual(1n);
expect(await table2.countRows()).toEqual(1);
// TODO: once we implement time travel we can uncomment this part of the test.
// await table2.checkout_latest();
// expect(await table2.countRows()).toEqual(2);
} else if (interval === 0) {
expect(await table2.countRows()).toEqual(2n);
expect(await table2.countRows()).toEqual(2);
} else {
// interval == 0.1
expect(await table2.countRows()).toEqual(1n);
expect(await table2.countRows()).toEqual(1);
await new Promise(r => setTimeout(r, 100));
expect(await table2.countRows()).toEqual(2n);
expect(await table2.countRows()).toEqual(2);
}
});
});
describe('schema evolution', function () {
let tmpDir: string;
beforeEach(() => {
tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "schema-evolution-"));
});
// Create a new sample table
it('can add a new column to the schema', async function () {
const con = await connect(tmpDir)
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2] }
])
await table.addColumns([{ name: 'price', valueSql: 'cast(10.0 as float)' }])
const expectedSchema = new Schema([
new Field('id', new Int64(), true),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true)), true),
new Field('price', new Float32(), false)
])
expect(await table.schema()).toEqual(expectedSchema)
});
it('can alter the columns in the schema', async function () {
const con = await connect(tmpDir)
const schema = new Schema([
new Field('id', new Int64(), true),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true)), true),
new Field('price', new Float64(), false)
])
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2] }
])
// Can create a non-nullable column only through addColumns at the moment.
await table.addColumns([{ name: 'price', valueSql: 'cast(10.0 as double)' }])
expect(await table.schema()).toEqual(schema)
await table.alterColumns([
{ path: 'id', rename: 'new_id' },
{ path: 'price', nullable: true }
])
const expectedSchema = new Schema([
new Field('new_id', new Int64(), true),
new Field('vector', new FixedSizeList(2, new Field('item', new Float32(), true)), true),
new Field('price', new Float64(), true)
])
expect(await table.schema()).toEqual(expectedSchema)
});
it('can drop a column from the schema', async function () {
const con = await connect(tmpDir)
const table = await con.createTable('vectors', [
{ id: 1n, vector: [0.1, 0.2] }
])
await table.dropColumns(['vector'])
const expectedSchema = new Schema([
new Field('id', new Int64(), true)
])
expect(await table.schema()).toEqual(expectedSchema)
});
});

View File

@@ -0,0 +1,15 @@
{
"extends": "../tsconfig.json",
"compilerOptions": {
"outDir": "./dist/spec",
"module": "commonjs",
"target": "es2022",
"types": [
"jest",
"node"
]
},
"include": [
"**/*",
]
}

View File

@@ -17,6 +17,24 @@ import { Connection as _NativeConnection } from "./native";
import { Table } from "./table";
import { Table as ArrowTable } from "apache-arrow";
export interface CreateTableOptions {
/**
* The mode to use when creating the table.
*
* If this is set to "create" and the table already exists then either
* an error will be thrown or, if existOk is true, then nothing will
* happen. Any provided data will be ignored.
*
* If this is set to "overwrite" then any existing table will be replaced.
*/
mode: "create" | "overwrite";
/**
* If this is true and the table already exists and the mode is "create"
* then no error will be raised.
*/
existOk: boolean;
}
/**
* A LanceDB Connection that allows you to open tables and create new ones.
*
@@ -53,10 +71,18 @@ export class Connection {
*/
async createTable(
name: string,
data: Record<string, unknown>[] | ArrowTable
data: Record<string, unknown>[] | ArrowTable,
options?: Partial<CreateTableOptions>
): Promise<Table> {
let mode: string = options?.mode ?? "create";
const existOk = options?.existOk ?? false;
if (mode === "create" && existOk) {
mode = "exist_ok";
}
const buf = toBuffer(data);
const innerTable = await this.inner.createTable(name, buf);
const innerTable = await this.inner.createTable(name, buf, mode);
return new Table(innerTable);
}

View File

@@ -12,6 +12,38 @@ export const enum MetricType {
Cosine = 1,
Dot = 2
}
/**
* A definition of a column alteration. The alteration changes the column at
* `path` to have the new name `name`, to be nullable if `nullable` is true,
* and to have the data type `data_type`. At least one of `rename` or `nullable`
* must be provided.
*/
export interface ColumnAlteration {
/**
* The path to the column to alter. This is a dot-separated path to the column.
* If it is a top-level column then it is just the name of the column. If it is
* a nested column then it is the path to the column, e.g. "a.b.c" for a column
* `c` nested inside a column `b` nested inside a column `a`.
*/
path: string
/**
* The new name of the column. If not provided then the name will not be changed.
* This must be distinct from the names of all other columns in the table.
*/
rename?: string
/** Set the new nullability. Note that a nullable column cannot be made non-nullable. */
nullable?: boolean
}
/** A definition of a new column to add to a table. */
export interface AddColumnsSql {
/** The name of the new column. */
name: string
/**
* The values to populate the new column with, as a SQL expression.
* The expression can reference other columns in the table.
*/
valueSql: string
}
export interface ConnectionOptions {
uri: string
apiKey?: string
@@ -53,7 +85,7 @@ export class Connection {
* - buf: The buffer containing the IPC file.
*
*/
createTable(name: string, buf: Buffer): Promise<Table>
createTable(name: string, buf: Buffer, mode: string): Promise<Table>
openTable(name: string): Promise<Table>
/** Drop table with the name. Or raise an error if the table does not exist. */
dropTable(name: string): Promise<void>
@@ -85,8 +117,11 @@ export class Table {
/** Return Schema as empty Arrow IPC file. */
schema(): Promise<Buffer>
add(buf: Buffer): Promise<void>
countRows(filter?: string | undefined | null): Promise<bigint>
countRows(filter?: string | undefined | null): Promise<number>
delete(predicate: string): Promise<void>
createIndex(): IndexBuilder
query(): Query
addColumns(transforms: Array<AddColumnsSql>): Promise<void>
alterColumns(alterations: Array<ColumnAlteration>): Promise<void>
dropColumns(columns: Array<string>): Promise<void>
}

View File

@@ -13,7 +13,7 @@
// limitations under the License.
import { Schema, tableFromIPC } from "apache-arrow";
import { Table as _NativeTable } from "./native";
import { AddColumnsSql, ColumnAlteration, Table as _NativeTable } from "./native";
import { toBuffer, Data } from "./arrow";
import { Query } from "./query";
import { IndexBuilder } from "./indexer";
@@ -50,7 +50,7 @@ export class Table {
}
/** Count the total number of rows in the dataset. */
async countRows(filter?: string): Promise<bigint> {
async countRows(filter?: string): Promise<number> {
return await this.inner.countRows(filter);
}
@@ -150,4 +150,42 @@ export class Table {
}
return q;
}
// TODO: Support BatchUDF
/**
* Add new columns with defined values.
*
* @param newColumnTransforms pairs of column names and the SQL expression to use
* to calculate the value of the new column. These
* expressions will be evaluated for each row in the
* table, and can reference existing columns in the table.
*/
async addColumns(newColumnTransforms: AddColumnsSql[]): Promise<void> {
await this.inner.addColumns(newColumnTransforms);
}
/**
* Alter the name or nullability of columns.
*
* @param columnAlterations One or more alterations to apply to columns.
*/
async alterColumns(columnAlterations: ColumnAlteration[]): Promise<void> {
await this.inner.alterColumns(columnAlterations);
}
/**
* Drop one or more columns from the dataset
*
* This is a metadata-only operation and does not remove the data from the
* underlying storage. In order to remove the data, you must subsequently
* call ``compact_files`` to rewrite the data without the removed columns and
* then call ``cleanup_files`` to remove the old files.
*
* @param columnNames The names of the columns to drop. These can be nested
* column references (e.g. "a.b.c") or top-level column
* names (e.g. "a").
*/
async dropColumns(columnNames: string[]): Promise<void> {
await this.inner.dropColumns(columnNames);
}
}

View File

@@ -51,8 +51,7 @@
"docs": "typedoc --plugin typedoc-plugin-markdown lancedb/index.ts",
"lint": "eslint lancedb --ext .js,.ts",
"prepublishOnly": "napi prepublish -t npm",
"//": "maxWorkers=1 is workaround for bigint issue in jest: https://github.com/jestjs/jest/issues/11617#issuecomment-1068732414",
"test": "npm run build && jest --maxWorkers=1",
"test": "npm run build && jest --verbose",
"universal": "napi universal",
"version": "napi version"
},
@@ -62,7 +61,7 @@
"lancedb-linux-arm64-gnu": "0.4.3",
"lancedb-linux-x64-gnu": "0.4.3"
},
"dependencies": {
"peerDependencies": {
"apache-arrow": "^15.0.0"
}
}

View File

@@ -17,7 +17,7 @@ use napi_derive::*;
use crate::table::Table;
use crate::ConnectionOptions;
use lancedb::connection::{ConnectBuilder, Connection as LanceDBConnection};
use lancedb::connection::{ConnectBuilder, Connection as LanceDBConnection, CreateTableMode};
use lancedb::ipc::ipc_file_to_batches;
#[napi]
@@ -25,6 +25,17 @@ pub struct Connection {
conn: LanceDBConnection,
}
impl Connection {
fn parse_create_mode_str(mode: &str) -> napi::Result<CreateTableMode> {
match mode {
"create" => Ok(CreateTableMode::Create),
"overwrite" => Ok(CreateTableMode::Overwrite),
"exist_ok" => Ok(CreateTableMode::exist_ok(|builder| builder)),
_ => Err(napi::Error::from_reason(format!("Invalid mode {}", mode))),
}
}
}
#[napi]
impl Connection {
/// Create a new Connection instance from the given URI.
@@ -65,12 +76,19 @@ impl Connection {
/// - buf: The buffer containing the IPC file.
///
#[napi]
pub async fn create_table(&self, name: String, buf: Buffer) -> napi::Result<Table> {
pub async fn create_table(
&self,
name: String,
buf: Buffer,
mode: String,
) -> napi::Result<Table> {
let batches = ipc_file_to_batches(buf.to_vec())
.map_err(|e| napi::Error::from_reason(format!("Failed to read IPC file: {}", e)))?;
let mode = Self::parse_create_mode_str(&mode)?;
let tbl = self
.conn
.create_table(&name, Box::new(batches))
.mode(mode)
.execute()
.await
.map_err(|e| napi::Error::from_reason(format!("{}", e)))?;

View File

@@ -13,8 +13,11 @@
// limitations under the License.
use arrow_ipc::writer::FileWriter;
use lancedb::table::AddDataOptions;
use lancedb::{ipc::ipc_file_to_batches, table::TableRef};
use lance::dataset::ColumnAlteration as LanceColumnAlteration;
use lancedb::{
ipc::ipc_file_to_batches,
table::{AddDataOptions, TableRef},
};
use napi::bindgen_prelude::*;
use napi_derive::napi;
@@ -65,13 +68,17 @@ impl Table {
}
#[napi]
pub async fn count_rows(&self, filter: Option<String>) -> napi::Result<usize> {
self.table.count_rows(filter).await.map_err(|e| {
napi::Error::from_reason(format!(
"Failed to count rows in table {}: {}",
self.table, e
))
})
pub async fn count_rows(&self, filter: Option<String>) -> napi::Result<i64> {
self.table
.count_rows(filter)
.await
.map(|val| val as i64)
.map_err(|e| {
napi::Error::from_reason(format!(
"Failed to count rows in table {}: {}",
self.table, e
))
})
}
#[napi]
@@ -93,4 +100,106 @@ impl Table {
pub fn query(&self) -> Query {
Query::new(self)
}
#[napi]
pub async fn add_columns(&self, transforms: Vec<AddColumnsSql>) -> napi::Result<()> {
let transforms = transforms
.into_iter()
.map(|sql| (sql.name, sql.value_sql))
.collect::<Vec<_>>();
let transforms = lance::dataset::NewColumnTransform::SqlExpressions(transforms);
self.table
.add_columns(transforms, None)
.await
.map_err(|err| {
napi::Error::from_reason(format!(
"Failed to add columns to table {}: {}",
self.table, err
))
})?;
Ok(())
}
#[napi]
pub async fn alter_columns(&self, alterations: Vec<ColumnAlteration>) -> napi::Result<()> {
for alteration in &alterations {
if alteration.rename.is_none() && alteration.nullable.is_none() {
return Err(napi::Error::from_reason(
"Alteration must have a 'rename' or 'nullable' field.",
));
}
}
let alterations = alterations
.into_iter()
.map(LanceColumnAlteration::from)
.collect::<Vec<_>>();
self.table
.alter_columns(&alterations)
.await
.map_err(|err| {
napi::Error::from_reason(format!(
"Failed to alter columns in table {}: {}",
self.table, err
))
})?;
Ok(())
}
#[napi]
pub async fn drop_columns(&self, columns: Vec<String>) -> napi::Result<()> {
let col_refs = columns.iter().map(String::as_str).collect::<Vec<_>>();
self.table.drop_columns(&col_refs).await.map_err(|err| {
napi::Error::from_reason(format!(
"Failed to drop columns from table {}: {}",
self.table, err
))
})?;
Ok(())
}
}
/// A definition of a column alteration. The alteration changes the column at
/// `path` to have the new name `name`, to be nullable if `nullable` is true,
/// and to have the data type `data_type`. At least one of `rename` or `nullable`
/// must be provided.
#[napi(object)]
pub struct ColumnAlteration {
/// The path to the column to alter. This is a dot-separated path to the column.
/// If it is a top-level column then it is just the name of the column. If it is
/// a nested column then it is the path to the column, e.g. "a.b.c" for a column
/// `c` nested inside a column `b` nested inside a column `a`.
pub path: String,
/// The new name of the column. If not provided then the name will not be changed.
/// This must be distinct from the names of all other columns in the table.
pub rename: Option<String>,
/// Set the new nullability. Note that a nullable column cannot be made non-nullable.
pub nullable: Option<bool>,
}
impl From<ColumnAlteration> for LanceColumnAlteration {
fn from(js: ColumnAlteration) -> Self {
let ColumnAlteration {
path,
rename,
nullable,
} = js;
Self {
path,
rename,
nullable,
// TODO: wire up this field
data_type: None,
}
}
}
/// A definition of a new column to add to a table.
#[napi(object)]
pub struct AddColumnsSql {
/// The name of the new column.
pub name: String,
/// The values to populate the new column with, as a SQL expression.
/// The expression can reference other columns in the table.
pub value_sql: String,
}

View File

@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.5.7
current_version = 0.6.1
commit = True
message = [python] Bump version: {current_version} → {new_version}
tag = True

24
python/ASYNC_MIGRATION.md Normal file
View File

@@ -0,0 +1,24 @@
# Migration from Sync to Async API
A new asynchronous API has been added to LanceDb. This API is built
on top of the rust lancedb crate (instead of being built on top of
pylance). This will help keep the various language bindings in sync.
There are some slight changes between the synchronous and the asynchronous
APIs. This document will help you migrate. These changes relate mostly
to the Connection and Table classes.
## Almost all functions are async
The most important change is that almost all functions are now async.
This means the functions now return `asyncio` coroutines. You will
need to use `await` to call these functions.
## Connection
No changes yet.
## Table
* Previously `Table.schema` was a property. Now it is an async method.
* The method `Table.__len__` was removed and `len(table)` will no longer
work. Use `Table.count_rows` instead.

30
python/Cargo.toml Normal file
View File

@@ -0,0 +1,30 @@
[package]
name = "lancedb-python"
version = "0.4.10"
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true
repository.workspace = true
keywords.workspace = true
categories.workspace = true
[lib]
name = "_lancedb"
crate-type = ["cdylib"]
[dependencies]
arrow = { version = "50.0.0", features = ["pyarrow"] }
lancedb = { path = "../rust/lancedb" }
env_logger = "0.10"
pyo3 = { version = "0.20", features = ["extension-module", "abi3-py38"] }
pyo3-asyncio = { version = "0.20", features = ["attributes", "tokio-runtime"] }
# Prevent dynamic linking of lzma, which comes from datafusion
lzma-sys = { version = "*", features = ["static"] }
[build-dependencies]
pyo3-build-config = { version = "0.20.3", features = [
"extension-module",
"abi3-py38",
] }

View File

@@ -20,10 +20,10 @@ results = table.search([0.1, 0.3]).limit(20).to_list()
print(results)
```
## Development
Create a virtual environment and activate it:
LanceDb is based on the rust crate `lancedb` and is built with maturin. In order to build with maturin
you will either need a conda environment or a virtual environment (venv).
```bash
python -m venv venv
@@ -33,7 +33,15 @@ python -m venv venv
Install the necessary packages:
```bash
python -m pip install .
python -m pip install .[tests,dev]
```
To build the python package you can use maturin:
```bash
# This will build the rust bindings and place them in the appropriate place
# in your venv or conda environment
matruin develop
```
To run the unit tests:
@@ -45,7 +53,7 @@ pytest
To run the doc tests:
```bash
pytest --doctest-modules lancedb
pytest --doctest-modules python/lancedb
```
To run linter and automatically fix all errors:
@@ -61,31 +69,27 @@ If any packages are missing, install them with:
pip install <PACKAGE_NAME>
```
___
For **Windows** users, there may be errors when installing packages, so these commands may be helpful:
Activate the virtual environment:
```bash
. .\venv\Scripts\activate
```
You may need to run the installs separately:
```bash
pip install -e .[tests]
pip install -e .[dev]
```
`tantivy` requires `rust` to be installed, so install it with `conda`, as it doesn't support windows installation:
```bash
pip install wheel
pip install cargo
conda install rust
pip install tantivy
```
To run the unit tests:
```bash
pytest
```

3
python/build.rs Normal file
View File

@@ -0,0 +1,3 @@
fn main() {
pyo3_build_config::add_extension_module_link_args();
}

View File

@@ -1,36 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
from typing import Iterable, List, Union
import numpy as np
import pyarrow as pa
from .util import safe_import_pandas
pd = safe_import_pandas()
DATA = Union[List[dict], dict, "pd.DataFrame", pa.Table, Iterable[pa.RecordBatch]]
VEC = Union[list, np.ndarray, pa.Array, pa.ChunkedArray]
URI = Union[str, Path]
VECTOR_COLUMN_NAME = "vector"
class Credential(str):
"""Credential field"""
def __repr__(self) -> str:
return "********"
def __str__(self) -> str:
return "********"

View File

@@ -1,9 +1,9 @@
[project]
name = "lancedb"
version = "0.5.7"
version = "0.6.1"
dependencies = [
"deprecation",
"pylance==0.9.18",
"pylance==0.10.1",
"ratelimiter~=1.0",
"retry>=0.9.2",
"tqdm>=4.27.0",
@@ -14,7 +14,7 @@ dependencies = [
"pyyaml>=6.0",
"click>=8.1.7",
"requests>=2.31.0",
"overrides>=0.7"
"overrides>=0.7",
]
description = "lancedb"
authors = [{ name = "LanceDB Devs", email = "dev@lancedb.com" }]
@@ -26,7 +26,7 @@ keywords = [
"data-science",
"machine-learning",
"arrow",
"data-analytics"
"data-analytics",
]
classifiers = [
"Development Status :: 3 - Alpha",
@@ -48,21 +48,53 @@ classifiers = [
repository = "https://github.com/lancedb/lancedb"
[project.optional-dependencies]
tests = ["aiohttp", "pandas>=1.4", "pytest", "pytest-mock", "pytest-asyncio", "duckdb", "pytz", "polars>=0.19"]
tests = [
"aiohttp",
"pandas>=1.4",
"pytest",
"pytest-mock",
"pytest-asyncio",
"duckdb",
"pytz",
"polars>=0.19",
]
dev = ["ruff", "pre-commit"]
docs = ["mkdocs", "mkdocs-jupyter", "mkdocs-material", "mkdocstrings[python]", "mkdocs-ultralytics-plugin==0.0.44"]
docs = [
"mkdocs",
"mkdocs-jupyter",
"mkdocs-material",
"mkdocstrings[python]",
"mkdocs-ultralytics-plugin==0.0.44",
]
clip = ["torch", "pillow", "open-clip"]
embeddings = ["openai>=1.6.1", "sentence-transformers", "torch", "pillow", "open-clip-torch", "cohere", "huggingface_hub",
"InstructorEmbedding", "google.generativeai", "boto3>=1.28.57", "awscli>=1.29.57", "botocore>=1.31.57"]
embeddings = [
"openai>=1.6.1",
"sentence-transformers",
"torch",
"pillow",
"open-clip-torch",
"cohere",
"huggingface_hub",
"InstructorEmbedding",
"google.generativeai",
"boto3>=1.28.57",
"awscli>=1.29.57",
"botocore>=1.31.57",
]
[tool.maturin]
python-source = "python"
module-name = "lancedb._lancedb"
[project.scripts]
lancedb = "lancedb.cli.cli:cli"
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"
requires = ["maturin>=1.4"]
build-backend = "maturin"
[tool.ruff]
[tool.ruff.lint]
select = ["F", "E", "W", "I", "G", "TCH", "PERF"]
[tool.pytest.ini_options]
@@ -70,5 +102,5 @@ addopts = "--strict-markers --ignore-glob=lancedb/embeddings/*.py"
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"asyncio"
"asyncio",
]

View File

@@ -19,8 +19,9 @@ from typing import Optional, Union
__version__ = importlib.metadata.version("lancedb")
from .common import URI
from .db import DBConnection, LanceDBConnection
from ._lancedb import connect as lancedb_connect
from .common import URI, sanitize_uri
from .db import AsyncConnection, AsyncLanceDBConnection, DBConnection, LanceDBConnection
from .remote.db import RemoteDBConnection
from .schema import vector # noqa: F401
from .utils import sentry_log # noqa: F401
@@ -101,3 +102,74 @@ def connect(
uri, api_key, region, host_override, request_thread_pool=request_thread_pool
)
return LanceDBConnection(uri, read_consistency_interval=read_consistency_interval)
async def connect_async(
uri: URI,
*,
api_key: Optional[str] = None,
region: str = "us-east-1",
host_override: Optional[str] = None,
read_consistency_interval: Optional[timedelta] = None,
request_thread_pool: Optional[Union[int, ThreadPoolExecutor]] = None,
) -> AsyncConnection:
"""Connect to a LanceDB database.
Parameters
----------
uri: str or Path
The uri of the database.
api_key: str, optional
If present, connect to LanceDB cloud.
Otherwise, connect to a database on file system or cloud storage.
Can be set via environment variable `LANCEDB_API_KEY`.
region: str, default "us-east-1"
The region to use for LanceDB Cloud.
host_override: str, optional
The override url for LanceDB Cloud.
read_consistency_interval: timedelta, default None
(For LanceDB OSS only)
The interval at which to check for updates to the table from other
processes. If None, then consistency is not checked. For performance
reasons, this is the default. For strong consistency, set this to
zero seconds. Then every read will check for updates from other
processes. As a compromise, you can set this to a non-zero timedelta
for eventual consistency. If more than that interval has passed since
the last check, then the table will be checked for updates. Note: this
consistency only applies to read operations. Write operations are
always consistent.
request_thread_pool: int or ThreadPoolExecutor, optional
The thread pool to use for making batch requests to the LanceDB Cloud API.
If an integer, then a ThreadPoolExecutor will be created with that
number of threads. If None, then a ThreadPoolExecutor will be created
with the default number of threads. If a ThreadPoolExecutor, then that
executor will be used for making requests. This is for LanceDB Cloud
only and is only used when making batch requests (i.e., passing in
multiple queries to the search method at once).
Examples
--------
For a local directory, provide a path for the database:
>>> import lancedb
>>> db = lancedb.connect("~/.lancedb")
For object storage, use a URI prefix:
>>> db = lancedb.connect("s3://my-bucket/lancedb")
Connect to LancdDB cloud:
>>> db = lancedb.connect("db://my_database", api_key="ldb_...")
Returns
-------
conn : DBConnection
A connection to a LanceDB database.
"""
return AsyncLanceDBConnection(
await lancedb_connect(
sanitize_uri(uri), api_key, region, host_override, read_consistency_interval
)
)

View File

@@ -0,0 +1,24 @@
from typing import Optional
import pyarrow as pa
class Connection(object):
async def table_names(self) -> list[str]: ...
async def create_table(
self, name: str, mode: str, data: pa.RecordBatchReader
) -> Table: ...
async def create_empty_table(
self, name: str, mode: str, schema: pa.Schema
) -> Table: ...
class Table(object):
def name(self) -> str: ...
async def schema(self) -> pa.Schema: ...
async def connect(
uri: str,
api_key: Optional[str],
region: Optional[str],
host_override: Optional[str],
read_consistency_interval: Optional[float],
) -> Connection: ...

View File

@@ -0,0 +1,136 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
from typing import Iterable, List, Optional, Union
import numpy as np
import pyarrow as pa
from .util import safe_import_pandas
pd = safe_import_pandas()
DATA = Union[List[dict], dict, "pd.DataFrame", pa.Table, Iterable[pa.RecordBatch]]
VEC = Union[list, np.ndarray, pa.Array, pa.ChunkedArray]
URI = Union[str, Path]
VECTOR_COLUMN_NAME = "vector"
class Credential(str):
"""Credential field"""
def __repr__(self) -> str:
return "********"
def __str__(self) -> str:
return "********"
def sanitize_uri(uri: URI) -> str:
return str(uri)
def _casting_recordbatch_iter(
input_iter: Iterable[pa.RecordBatch], schema: pa.Schema
) -> Iterable[pa.RecordBatch]:
"""
Wrapper around an iterator of record batches. If the batches don't match the
schema, try to cast them to the schema. If that fails, raise an error.
This is helpful for users who might have written the iterator with default
data types in PyArrow, but specified more specific types in the schema. For
example, PyArrow defaults to float64 for floating point types, but Lance
uses float32 for vectors.
"""
for batch in input_iter:
if not isinstance(batch, pa.RecordBatch):
raise TypeError(f"Expected RecordBatch, got {type(batch)}")
if batch.schema != schema:
try:
# RecordBatch doesn't have a cast method, but table does.
batch = pa.Table.from_batches([batch]).cast(schema).to_batches()[0]
except pa.lib.ArrowInvalid:
raise ValueError(
f"Input RecordBatch iterator yielded a batch with schema that "
f"does not match the expected schema.\nExpected:\n{schema}\n"
f"Got:\n{batch.schema}"
)
yield batch
def data_to_reader(
data: DATA, schema: Optional[pa.Schema] = None
) -> pa.RecordBatchReader:
"""Convert various types of input into a RecordBatchReader"""
if pd is not None and isinstance(data, pd.DataFrame):
return pa.Table.from_pandas(data, schema=schema).to_reader()
elif isinstance(data, pa.Table):
return data.to_reader()
elif isinstance(data, pa.RecordBatch):
return pa.Table.from_batches([data]).to_reader()
# elif isinstance(data, LanceDataset):
# return data_obj.scanner().to_reader()
elif isinstance(data, pa.dataset.Dataset):
return pa.dataset.Scanner.from_dataset(data).to_reader()
elif isinstance(data, pa.dataset.Scanner):
return data.to_reader()
elif isinstance(data, pa.RecordBatchReader):
return data
elif (
type(data).__module__.startswith("polars")
and data.__class__.__name__ == "DataFrame"
):
return data.to_arrow().to_reader()
# for other iterables, assume they are of type Iterable[RecordBatch]
elif isinstance(data, Iterable):
if schema is not None:
data = _casting_recordbatch_iter(data, schema)
return pa.RecordBatchReader.from_batches(schema, data)
else:
raise ValueError(
"Must provide schema to write dataset from RecordBatch iterable"
)
else:
raise TypeError(
f"Unknown data type {type(data)}. "
"Please check "
"https://lancedb.github.io/lance/read_and_write.html "
"to see supported types."
)
def validate_schema(schema: pa.Schema):
"""
Make sure the metadata is valid utf8
"""
if schema.metadata is not None:
_validate_metadata(schema.metadata)
def _validate_metadata(metadata: dict):
"""
Make sure the metadata values are valid utf8 (can be nested)
Raises ValueError if not valid utf8
"""
for k, v in metadata.items():
if isinstance(v, bytes):
try:
v.decode("utf8")
except UnicodeDecodeError:
raise ValueError(
f"Metadata key {k} is not valid utf8. "
"Consider base64 encode for generic binary metadata."
)
elif isinstance(v, dict):
_validate_metadata(v)

View File

@@ -13,6 +13,7 @@
from __future__ import annotations
import inspect
import os
from abc import abstractmethod
from pathlib import Path
@@ -22,15 +23,20 @@ import pyarrow as pa
from overrides import EnforceOverrides, override
from pyarrow import fs
from .table import LanceTable, Table
from lancedb.common import data_to_reader, validate_schema
from lancedb.embeddings.registry import EmbeddingFunctionRegistry
from lancedb.utils.events import register_event
from .pydantic import LanceModel
from .table import AsyncLanceTable, LanceTable, Table, _sanitize_data
from .util import fs_from_uri, get_uri_location, get_uri_scheme, join_uri
if TYPE_CHECKING:
from datetime import timedelta
from ._lancedb import Connection as LanceDbConnection
from .common import DATA, URI
from .embeddings import EmbeddingFunctionConfig
from .pydantic import LanceModel
class DBConnection(EnforceOverrides):
@@ -40,14 +46,21 @@ class DBConnection(EnforceOverrides):
def table_names(
self, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
"""List all table in this database
"""List all tables in this database, in sorted order
Parameters
----------
page_token: str, optional
The token to use for pagination. If not present, start from the beginning.
Typically, this token is last table name from the previous page.
Only supported by LanceDb Cloud.
limit: int, default 10
The size of the page to return.
Only supported by LanceDb Cloud.
Returns
-------
Iterable of str
"""
pass
@@ -412,3 +425,313 @@ class LanceDBConnection(DBConnection):
def drop_database(self):
filesystem, path = fs_from_uri(self.uri)
filesystem.delete_dir(path)
class AsyncConnection(EnforceOverrides):
"""An active LanceDB connection interface."""
@abstractmethod
async def table_names(
self, *, page_token: Optional[str] = None, limit: int = 10
) -> Iterable[str]:
"""List all tables in this database, in sorted order
Parameters
----------
page_token: str, optional
The token to use for pagination. If not present, start from the beginning.
Typically, this token is last table name from the previous page.
Only supported by LanceDb Cloud.
limit: int, default 10
The size of the page to return.
Only supported by LanceDb Cloud.
Returns
-------
Iterable of str
"""
pass
@abstractmethod
async def create_table(
self,
name: str,
data: Optional[DATA] = None,
schema: Optional[Union[pa.Schema, LanceModel]] = None,
mode: str = "create",
exist_ok: bool = False,
on_bad_vectors: str = "error",
fill_value: float = 0.0,
embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
) -> Table:
"""Create a [Table][lancedb.table.Table] in the database.
Parameters
----------
name: str
The name of the table.
data: The data to initialize the table, *optional*
User must provide at least one of `data` or `schema`.
Acceptable types are:
- dict or list-of-dict
- pandas.DataFrame
- pyarrow.Table or pyarrow.RecordBatch
schema: The schema of the table, *optional*
Acceptable types are:
- pyarrow.Schema
- [LanceModel][lancedb.pydantic.LanceModel]
mode: str; default "create"
The mode to use when creating the table.
Can be either "create" or "overwrite".
By default, if the table already exists, an exception is raised.
If you want to overwrite the table, use mode="overwrite".
exist_ok: bool, default False
If a table by the same name already exists, then raise an exception
if exist_ok=False. If exist_ok=True, then open the existing table;
it will not add the provided data but will validate against any
schema that's specified.
on_bad_vectors: str, default "error"
What to do if any of the vectors are not the same size or contains NaNs.
One of "error", "drop", "fill".
fill_value: float
The value to use when filling vectors. Only used if on_bad_vectors="fill".
Returns
-------
LanceTable
A reference to the newly created table.
!!! note
The vector index won't be created by default.
To create the index, call the `create_index` method on the table.
Examples
--------
Can create with list of tuples or dictionaries:
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
... {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}]
>>> db.create_table("my_table", data)
LanceTable(connection=..., name="my_table")
>>> db["my_table"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
You can also pass a pandas DataFrame:
>>> import pandas as pd
>>> data = pd.DataFrame({
... "vector": [[1.1, 1.2], [0.2, 1.8]],
... "lat": [45.5, 40.1],
... "long": [-122.7, -74.1]
... })
>>> db.create_table("table2", data)
LanceTable(connection=..., name="table2")
>>> db["table2"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
Data is converted to Arrow before being written to disk. For maximum
control over how data is saved, either provide the PyArrow schema to
convert to or else provide a [PyArrow Table](pyarrow.Table) directly.
>>> custom_schema = pa.schema([
... pa.field("vector", pa.list_(pa.float32(), 2)),
... pa.field("lat", pa.float32()),
... pa.field("long", pa.float32())
... ])
>>> db.create_table("table3", data, schema = custom_schema)
LanceTable(connection=..., name="table3")
>>> db["table3"].head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
lat: float
long: float
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]
It is also possible to create an table from `[Iterable[pa.RecordBatch]]`:
>>> import pyarrow as pa
>>> def make_batches():
... for i in range(5):
... yield pa.RecordBatch.from_arrays(
... [
... pa.array([[3.1, 4.1], [5.9, 26.5]],
... pa.list_(pa.float32(), 2)),
... pa.array(["foo", "bar"]),
... pa.array([10.0, 20.0]),
... ],
... ["vector", "item", "price"],
... )
>>> schema=pa.schema([
... pa.field("vector", pa.list_(pa.float32(), 2)),
... pa.field("item", pa.utf8()),
... pa.field("price", pa.float32()),
... ])
>>> db.create_table("table4", make_batches(), schema=schema)
LanceTable(connection=..., name="table4")
"""
raise NotImplementedError
async def open_table(self, name: str) -> Table:
"""Open a Lance Table in the database.
Parameters
----------
name: str
The name of the table.
Returns
-------
A LanceTable object representing the table.
"""
raise NotImplementedError
async def drop_table(self, name: str):
"""Drop a table from the database.
Parameters
----------
name: str
The name of the table.
"""
raise NotImplementedError
async def drop_database(self):
"""
Drop database
This is the same thing as dropping all the tables
"""
raise NotImplementedError
class AsyncLanceDBConnection(AsyncConnection):
def __init__(self, connection: LanceDbConnection):
self._inner = connection
async def __repr__(self) -> str:
pass
@override
async def table_names(
self,
*,
page_token=None,
limit=None,
) -> Iterable[str]:
# TODO: hook in page_token and limit
return await self._inner.table_names()
@override
async def create_table(
self,
name: str,
data: Optional[DATA] = None,
schema: Optional[Union[pa.Schema, LanceModel]] = None,
mode: str = "create",
exist_ok: bool = False,
on_bad_vectors: str = "error",
fill_value: float = 0.0,
embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
) -> Table:
if mode.lower() not in ["create", "overwrite"]:
raise ValueError("mode must be either 'create' or 'overwrite'")
if inspect.isclass(schema) and issubclass(schema, LanceModel):
# convert LanceModel to pyarrow schema
# note that it's possible this contains
# embedding function metadata already
schema = schema.to_arrow_schema()
metadata = None
if embedding_functions is not None:
# If we passed in embedding functions explicitly
# then we'll override any schema metadata that
# may was implicitly specified by the LanceModel schema
registry = EmbeddingFunctionRegistry.get_instance()
metadata = registry.get_table_metadata(embedding_functions)
if data is not None:
data = _sanitize_data(
data,
schema,
metadata=metadata,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
)
if schema is None:
if data is None:
raise ValueError("Either data or schema must be provided")
elif hasattr(data, "schema"):
schema = data.schema
elif isinstance(data, Iterable):
if metadata:
raise TypeError(
(
"Persistent embedding functions not yet "
"supported for generator data input"
)
)
if metadata:
schema = schema.with_metadata(metadata)
validate_schema(schema)
if mode == "create" and exist_ok:
mode = "exist_ok"
if data is None:
new_table = await self._inner.create_empty_table(name, mode, schema)
else:
data = data_to_reader(data, schema)
new_table = await self._inner.create_table(
name,
mode,
data,
)
register_event("create_table")
return AsyncLanceTable(new_table)
@override
async def open_table(self, name: str) -> LanceTable:
raise NotImplementedError
@override
async def drop_table(self, name: str, ignore_missing: bool = False):
raise NotImplementedError
@override
async def drop_database(self):
raise NotImplementedError

View File

@@ -103,9 +103,9 @@ class InstructorEmbeddingFunction(TextEmbeddingFunction):
# convert_to_numpy: bool = True # Hardcoding this as numpy can be ingested directly
source_instruction: str = "represent the document for retrieval"
query_instruction: (
str
) = "represent the document for retrieving the most similar documents"
query_instruction: str = (
"represent the document for retrieving the most similar documents"
)
@weak_lru(maxsize=1)
def ndims(self):

View File

@@ -12,6 +12,7 @@
# limitations under the License.
"""Full text search index using tantivy-py"""
import os
from typing import List, Tuple

View File

@@ -16,7 +16,7 @@ from __future__ import annotations
from abc import ABC, abstractmethod
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from typing import TYPE_CHECKING, List, Literal, Optional, Tuple, Type, Union
from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple, Type, Union
import deprecation
import numpy as np
@@ -93,7 +93,7 @@ class Query(pydantic.BaseModel):
metric: str = "L2"
# which columns to return in the results
columns: Optional[List[str]] = None
columns: Optional[Union[List[str], Dict[str, str]]] = None
# optional query parameters for tuning the results,
# e.g. `{"nprobes": "10", "refine_factor": "10"}`
@@ -117,23 +117,36 @@ class LanceQueryBuilder(ABC):
query: Optional[Union[np.ndarray, str, "PIL.Image.Image", Tuple]],
query_type: str,
vector_column_name: str,
vector: Optional[VEC] = None,
text: Optional[str] = None,
) -> LanceQueryBuilder:
if query is None:
if query is None and vector is None and text is None:
return LanceEmptyQueryBuilder(table)
if query_type == "hybrid":
# hybrid fts and vector query
return LanceHybridQueryBuilder(table, query, vector_column_name)
return LanceHybridQueryBuilder(
table, query, vector_column_name, vector, text
)
# convert "auto" query_type to "vector", "fts"
# or "hybrid" and convert the query to vector if needed
# Resolve hybrid query with explicit vector and text params here to avoid
# adding them as params in the BaseQueryBuilder class
if vector is not None or text is not None:
if query_type not in ["hybrid", "auto"]:
raise ValueError(
"If `vector` and `text` are provided, then `query_type`\
must be 'hybrid' or 'auto'"
)
return LanceHybridQueryBuilder(
table, query, vector_column_name, vector, text
)
# convert "auto" query_type to "vector" or "fts"
# and convert the query to vector if needed
query, query_type = cls._resolve_query(
table, query, query_type, vector_column_name
)
if query_type == "hybrid":
return LanceHybridQueryBuilder(table, query, vector_column_name)
if isinstance(query, str):
# fts
return LanceFtsQueryBuilder(table, query)
@@ -161,8 +174,6 @@ class LanceQueryBuilder(ABC):
elif query_type == "auto":
if isinstance(query, (list, np.ndarray)):
return query, "vector"
if isinstance(query, tuple):
return query, "hybrid"
else:
conf = table.embedding_functions.get(vector_column_name)
if conf is not None:
@@ -321,20 +332,25 @@ class LanceQueryBuilder(ABC):
self._limit = limit
return self
def select(self, columns: list) -> LanceQueryBuilder:
def select(self, columns: Union[list[str], dict[str, str]]) -> LanceQueryBuilder:
"""Set the columns to return.
Parameters
----------
columns: list
The columns to return.
columns: list of str, or dict of str to str default None
List of column names to be fetched.
Or a dictionary of column names to SQL expressions.
All columns are fetched if None or unspecified.
Returns
-------
LanceQueryBuilder
The LanceQueryBuilder object.
"""
self._columns = columns
if isinstance(columns, list) or isinstance(columns, dict):
self._columns = columns
else:
raise ValueError("columns must be a list or a dictionary")
return self
def where(self, where: str, prefilter: bool = False) -> LanceQueryBuilder:
@@ -392,7 +408,7 @@ class LanceVectorQueryBuilder(LanceQueryBuilder):
>>> (table.search([0.4, 0.4])
... .metric("cosine")
... .where("b < 10")
... .select(["b"])
... .select(["b", "vector"])
... .limit(2)
... .to_pandas())
b vector _distance
@@ -623,12 +639,20 @@ class LanceEmptyQueryBuilder(LanceQueryBuilder):
class LanceHybridQueryBuilder(LanceQueryBuilder):
def __init__(self, table: "Table", query: str, vector_column: str):
def __init__(
self,
table: "Table",
query: str,
vector_column: str,
vector: Optional[VEC] = None,
text: Optional[str] = None,
):
super().__init__(table)
self._validate_fts_index()
vector_query, fts_query = self._validate_query(query)
vector_query, fts_query = self._validate_query(
query, vector_column, vector, text
)
self._fts_query = LanceFtsQueryBuilder(table, fts_query)
vector_query = self._query_to_vector(table, vector_query, vector_column)
self._vector_query = LanceVectorQueryBuilder(table, vector_query, vector_column)
self._norm = "score"
self._reranker = LinearCombinationReranker(weight=0.7, fill=1.0)
@@ -639,23 +663,31 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
"Please create a full-text search index " "to perform hybrid search."
)
def _validate_query(self, query):
# Temp hack to support vectorized queries for hybrid search
if isinstance(query, str):
return query, query
elif isinstance(query, tuple):
if len(query) != 2:
def _validate_query(self, query, vector_column, vector, text):
if query is not None:
if vector is not None or text is not None:
raise ValueError(
"The query must be a tuple of (vector_query, fts_query)."
"Either pass `query` or `vector` and `text` separately, not both."
)
if not isinstance(query[0], (list, np.ndarray, pa.Array, pa.ChunkedArray)):
else:
if vector is None or text is None:
raise ValueError(
"Either pass `query` or `vector` and `text` separately, not both."
)
if vector is not None and text is not None:
if not isinstance(vector, (list, np.ndarray, pa.Array, pa.ChunkedArray)):
raise ValueError(f"The vector query must be one of {VEC}.")
if not isinstance(query[1], str):
if not isinstance(text, str):
raise ValueError("The fts query must be a string.")
return query[0], query[1]
return vector, text
if isinstance(query, str):
vector = self._query_to_vector(self._table, query, vector_column)
return vector, query
else:
raise ValueError(
"The query must be either a string or a tuple of (vector, string)."
f"For hybrid search `query` must be a string or `vector` and `text` \
must be provided explicitly of types {VEC} and str respectively."
)
def to_arrow(self) -> pa.Table:

View File

@@ -15,7 +15,7 @@ import logging
import uuid
from concurrent.futures import Future
from functools import cached_property
from typing import Dict, Optional, Union
from typing import Dict, Iterable, Optional, Union
import pyarrow as pa
from lance import json_to_schema
@@ -66,12 +66,36 @@ class RemoteTable(Table):
"""to_pandas() is not yet supported on LanceDB cloud."""
return NotImplementedError("to_pandas() is not yet supported on LanceDB cloud.")
def create_scalar_index(self, *args, **kwargs):
"""Creates a scalar index"""
return NotImplementedError(
"create_scalar_index() is not yet supported on LanceDB cloud."
def list_indices(self):
"""List all the indices on the table"""
print(self._name)
resp = self._conn._client.post(f"/v1/table/{self._name}/index/list/")
return resp
def create_scalar_index(
self,
column: str,
):
"""Creates a scalar index
Parameters
----------
column : str
The column to be indexed. Must be a boolean, integer, float,
or string column.
"""
index_type = "scalar"
data = {
"column": column,
"index_type": index_type,
"replace": True,
}
resp = self._conn._client.post(
f"/v1/table/{self._name}/create_scalar_index/", data=data
)
return resp
def create_index(
self,
metric="L2",
@@ -277,6 +301,7 @@ class RemoteTable(Table):
f = Future()
f.set_result(self._conn._client.query(name, q))
return f
else:
def submit(name, q):
@@ -473,6 +498,21 @@ class RemoteTable(Table):
"count_rows() is not yet supported on the LanceDB cloud"
)
def add_columns(self, transforms: Dict[str, str]):
raise NotImplementedError(
"add_columns() is not yet supported on the LanceDB cloud"
)
def alter_columns(self, alterations: Iterable[Dict[str, str]]):
raise NotImplementedError(
"alter_columns() is not yet supported on the LanceDB cloud"
)
def drop_columns(self, columns: Iterable[str]):
raise NotImplementedError(
"drop_columns() is not yet supported on the LanceDB cloud"
)
def add_index(tbl: pa.Table, i: int) -> pa.Table:
return tbl.add_column(

View File

@@ -12,6 +12,7 @@
# limitations under the License.
"""Schema related utilities."""
import pyarrow as pa

View File

@@ -28,6 +28,7 @@ import pyarrow.compute as pc
import pyarrow.fs as pa_fs
from lance import LanceDataset
from lance.vector import vec_to_table
from overrides import override
from .common import DATA, VEC, VECTOR_COLUMN_NAME
from .embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
@@ -48,6 +49,7 @@ if TYPE_CHECKING:
import PIL
from lance.dataset import CleanupStats, ReaderLike
from ._lancedb import Table as LanceDBTable
from .db import LanceDBConnection
@@ -160,7 +162,7 @@ class Table(ABC):
Can query the table with [Table.search][lancedb.table.Table.search].
>>> table.search([0.4, 0.4]).select(["b"]).to_pandas()
>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
b vector _distance
0 4 [0.5, 1.3] 0.82
1 2 [1.1, 1.2] 1.13
@@ -416,6 +418,8 @@ class Table(ABC):
query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
vector_column_name: Optional[str] = None,
query_type: str = "auto",
vector: Optional[VEC] = None,
text: Optional[str] = None,
) -> LanceQueryBuilder:
"""Create a search query to find the nearest neighbors
of the given query vector. We currently support [vector search][search]
@@ -436,7 +440,7 @@ class Table(ABC):
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
... .where("original_width > 1000", prefilter=True)
... .select(["caption", "original_width"])
... .select(["caption", "original_width", "vector"])
... .limit(2)
... .to_pandas())
caption original_width vector _distance
@@ -660,6 +664,56 @@ class Table(ABC):
For most cases, the default should be fine.
"""
@abstractmethod
def add_columns(self, transforms: Dict[str, str]):
"""
Add new columns with defined values.
This is not yet available in LanceDB Cloud.
Parameters
----------
transforms: Dict[str, str]
A map of column name to a SQL expression to use to calculate the
value of the new column. These expressions will be evaluated for
each row in the table, and can reference existing columns.
"""
@abstractmethod
def alter_columns(self, alterations: Iterable[Dict[str, str]]):
"""
Alter column names and nullability.
This is not yet available in LanceDB Cloud.
alterations : Iterable[Dict[str, Any]]
A sequence of dictionaries, each with the following keys:
- "path": str
The column path to alter. For a top-level column, this is the name.
For a nested column, this is the dot-separated path, e.g. "a.b.c".
- "name": str, optional
The new name of the column. If not specified, the column name is
not changed.
- "nullable": bool, optional
Whether the column should be nullable. If not specified, the column
nullability is not changed. Only non-nullable columns can be changed
to nullable. Currently, you cannot change a nullable column to
non-nullable.
"""
@abstractmethod
def drop_columns(self, columns: Iterable[str]):
"""
Drop columns from the table.
This is not yet available in LanceDB Cloud.
Parameters
----------
columns : Iterable[str]
The names of the columns to drop.
"""
class _LanceDatasetRef(ABC):
@property
@@ -1201,6 +1255,8 @@ class LanceTable(Table):
query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
vector_column_name: Optional[str] = None,
query_type: str = "auto",
vector: Optional[VEC] = None,
text: Optional[str] = None,
) -> LanceQueryBuilder:
"""Create a search query to find the nearest neighbors
of the given query vector. We currently support [vector search][search]
@@ -1219,7 +1275,7 @@ class LanceTable(Table):
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
... .where("original_width > 1000", prefilter=True)
... .select(["caption", "original_width"])
... .select(["caption", "original_width", "vector"])
... .limit(2)
... .to_pandas())
caption original_width vector _distance
@@ -1255,6 +1311,10 @@ class LanceTable(Table):
or raise an error if no corresponding embedding function is found.
If the `query` is a string, then the query type is "vector" if the
table has embedding functions, else the query type is "fts"
vector: list/np.ndarray, default None
vector query for hybrid search
text: str, default None
text query for hybrid search
Returns
-------
@@ -1264,11 +1324,17 @@ class LanceTable(Table):
and also the "_distance" column which is the distance between the query
vector and the returned vector.
"""
if vector_column_name is None and query is not None:
is_query_defined = query is not None or vector is not None or text is not None
if vector_column_name is None and is_query_defined:
vector_column_name = inf_vector_column_query(self.schema)
register_event("search_table")
return LanceQueryBuilder.create(
self, query, query_type, vector_column_name=vector_column_name
self,
query,
query_type,
vector_column_name=vector_column_name,
vector=vector,
text=text,
)
@classmethod
@@ -1536,6 +1602,22 @@ class LanceTable(Table):
"""
return self.to_lance().optimize.compact_files(*args, **kwargs)
def add_columns(self, transforms: Dict[str, str]):
self._dataset_mut.add_columns(transforms)
def alter_columns(self, *alterations: Iterable[Dict[str, str]]):
modified = []
# I called this name in pylance, but I think I regret that now. So we
# allow both name and rename.
for alter in alterations:
if "rename" in alter:
alter["name"] = alter.pop("rename")
modified.append(alter)
self._dataset_mut.alter_columns(*modified)
def drop_columns(self, columns: Iterable[str]):
self._dataset_mut.drop_columns(columns)
def _sanitize_schema(
data: pa.Table,
@@ -1714,3 +1796,715 @@ def _sanitize_nans(data, fill_value, on_bad_vectors, vec_arr, vector_column_name
is_full = np.any(~is_value_nan.reshape(-1, vec_arr.type.list_size), axis=1)
data = data.filter(is_full)
return data
class AsyncTable(ABC):
"""
A Table is a collection of Records in a LanceDB Database.
Examples
--------
Create using [DBConnection.create_table][lancedb.DBConnection.create_table]
(more examples in that method's documentation).
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2}])
>>> table.head()
pyarrow.Table
vector: fixed_size_list<item: float>[2]
child 0, item: float
b: int64
----
vector: [[[1.1,1.2]]]
b: [[2]]
Can append new data with [Table.add()][lancedb.table.Table.add].
>>> table.add([{"vector": [0.5, 1.3], "b": 4}])
Can query the table with [Table.search][lancedb.table.Table.search].
>>> table.search([0.4, 0.4]).select(["b", "vector"]).to_pandas()
b vector _distance
0 4 [0.5, 1.3] 0.82
1 2 [1.1, 1.2] 1.13
Search queries are much faster when an index is created. See
[Table.create_index][lancedb.table.Table.create_index].
"""
@property
@abstractmethod
def name(self) -> str:
"""The name of the table."""
raise NotImplementedError
@abstractmethod
async def schema(self) -> pa.Schema:
"""The [Arrow Schema](https://arrow.apache.org/docs/python/api/datatypes.html#)
of this Table
"""
raise NotImplementedError
@abstractmethod
async def count_rows(self, filter: Optional[str] = None) -> int:
"""
Count the number of rows in the table.
Parameters
----------
filter: str, optional
A SQL where clause to filter the rows to count.
"""
raise NotImplementedError
async def to_pandas(self) -> "pd.DataFrame":
"""Return the table as a pandas DataFrame.
Returns
-------
pd.DataFrame
"""
return self.to_arrow().to_pandas()
@abstractmethod
async def to_arrow(self) -> pa.Table:
"""Return the table as a pyarrow Table.
Returns
-------
pa.Table
"""
raise NotImplementedError
async def create_index(
self,
metric="L2",
num_partitions=256,
num_sub_vectors=96,
vector_column_name: str = VECTOR_COLUMN_NAME,
replace: bool = True,
accelerator: Optional[str] = None,
index_cache_size: Optional[int] = None,
):
"""Create an index on the table.
Parameters
----------
metric: str, default "L2"
The distance metric to use when creating the index.
Valid values are "L2", "cosine", or "dot".
L2 is euclidean distance.
num_partitions: int, default 256
The number of IVF partitions to use when creating the index.
Default is 256.
num_sub_vectors: int, default 96
The number of PQ sub-vectors to use when creating the index.
Default is 96.
vector_column_name: str, default "vector"
The vector column name to create the index.
replace: bool, default True
- If True, replace the existing index if it exists.
- If False, raise an error if duplicate index exists.
accelerator: str, default None
If set, use the given accelerator to create the index.
Only support "cuda" for now.
index_cache_size : int, optional
The size of the index cache in number of entries. Default value is 256.
"""
raise NotImplementedError
@abstractmethod
async def create_scalar_index(
self,
column: str,
*,
replace: bool = True,
):
"""Create a scalar index on a column.
Scalar indices, like vector indices, can be used to speed up scans. A scalar
index can speed up scans that contain filter expressions on the indexed column.
For example, the following scan will be faster if the column ``my_col`` has
a scalar index:
.. code-block:: python
import lancedb
db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
my_df = img_table.search().where("my_col = 7", prefilter=True).to_pandas()
Scalar indices can also speed up scans containing a vector search and a
prefilter:
.. code-block::python
import lancedb
db = lancedb.connect("/data/lance")
img_table = db.open_table("images")
img_table.search([1, 2, 3, 4], vector_column_name="vector")
.where("my_col != 7", prefilter=True)
.to_pandas()
Scalar indices can only speed up scans for basic filters using
equality, comparison, range (e.g. ``my_col BETWEEN 0 AND 100``), and set
membership (e.g. `my_col IN (0, 1, 2)`)
Scalar indices can be used if the filter contains multiple indexed columns and
the filter criteria are AND'd or OR'd together
(e.g. ``my_col < 0 AND other_col> 100``)
Scalar indices may be used if the filter contains non-indexed columns but,
depending on the structure of the filter, they may not be usable. For example,
if the column ``not_indexed`` does not have a scalar index then the filter
``my_col = 0 OR not_indexed = 1`` will not be able to use any scalar index on
``my_col``.
**Experimental API**
Parameters
----------
column : str
The column to be indexed. Must be a boolean, integer, float,
or string column.
replace : bool, default True
Replace the existing index if it exists.
Examples
--------
.. code-block:: python
import lance
dataset = lance.dataset("./images.lance")
dataset.create_scalar_index("category")
"""
raise NotImplementedError
@abstractmethod
async def add(
self,
data: DATA,
mode: str = "append",
on_bad_vectors: str = "error",
fill_value: float = 0.0,
):
"""Add more data to the [Table](Table).
Parameters
----------
data: DATA
The data to insert into the table. Acceptable types are:
- dict or list-of-dict
- pandas.DataFrame
- pyarrow.Table or pyarrow.RecordBatch
mode: str
The mode to use when writing the data. Valid values are
"append" and "overwrite".
on_bad_vectors: str, default "error"
What to do if any of the vectors are not the same size or contains NaNs.
One of "error", "drop", "fill".
fill_value: float, default 0.
The value to use when filling vectors. Only used if on_bad_vectors="fill".
"""
raise NotImplementedError
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
"""
Returns a [`LanceMergeInsertBuilder`][lancedb.merge.LanceMergeInsertBuilder]
that can be used to create a "merge insert" operation
This operation can add rows, update rows, and remove rows all in a single
transaction. It is a very generic tool that can be used to create
behaviors like "insert if not exists", "update or insert (i.e. upsert)",
or even replace a portion of existing data with new data (e.g. replace
all data where month="january")
The merge insert operation works by combining new data from a
**source table** with existing data in a **target table** by using a
join. There are three categories of records.
"Matched" records are records that exist in both the source table and
the target table. "Not matched" records exist only in the source table
(e.g. these are new data) "Not matched by source" records exist only
in the target table (this is old data)
The builder returned by this method can be used to customize what
should happen for each category of data.
Please note that the data may appear to be reordered as part of this
operation. This is because updated rows will be deleted from the
dataset and then reinserted at the end with the new values.
Parameters
----------
on: Union[str, Iterable[str]]
A column (or columns) to join on. This is how records from the
source table and target table are matched. Typically this is some
kind of key or id column.
Examples
--------
>>> import lancedb
>>> data = pa.table({"a": [2, 1, 3], "b": ["a", "b", "c"]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> new_data = pa.table({"a": [2, 3, 4], "b": ["x", "y", "z"]})
>>> # Perform a "upsert" operation
>>> table.merge_insert("a") \\
... .when_matched_update_all() \\
... .when_not_matched_insert_all() \\
... .execute(new_data)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
a b
0 1 b
1 2 x
2 3 y
3 4 z
"""
on = [on] if isinstance(on, str) else list(on.iter())
return LanceMergeInsertBuilder(self, on)
@abstractmethod
async def search(
self,
query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
vector_column_name: Optional[str] = None,
query_type: str = "auto",
) -> LanceQueryBuilder:
"""Create a search query to find the nearest neighbors
of the given query vector. We currently support [vector search][search]
and [full-text search][experimental-full-text-search].
All query options are defined in [Query][lancedb.query.Query].
Examples
--------
>>> import lancedb
>>> db = lancedb.connect("./.lancedb")
>>> data = [
... {"original_width": 100, "caption": "bar", "vector": [0.1, 2.3, 4.5]},
... {"original_width": 2000, "caption": "foo", "vector": [0.5, 3.4, 1.3]},
... {"original_width": 3000, "caption": "test", "vector": [0.3, 6.2, 2.6]}
... ]
>>> table = db.create_table("my_table", data)
>>> query = [0.4, 1.4, 2.4]
>>> (table.search(query)
... .where("original_width > 1000", prefilter=True)
... .select(["caption", "original_width", "vector"])
... .limit(2)
... .to_pandas())
caption original_width vector _distance
0 foo 2000 [0.5, 3.4, 1.3] 5.220000
1 test 3000 [0.3, 6.2, 2.6] 23.089996
Parameters
----------
query: list/np.ndarray/str/PIL.Image.Image, default None
The targetted vector to search for.
- *default None*.
Acceptable types are: list, np.ndarray, PIL.Image.Image
- If None then the select/where/limit clauses are applied to filter
the table
vector_column_name: str, optional
The name of the vector column to search.
The vector column needs to be a pyarrow fixed size list type
- If not specified then the vector column is inferred from
the table schema
- If the table has multiple vector columns then the *vector_column_name*
needs to be specified. Otherwise, an error is raised.
query_type: str
*default "auto"*.
Acceptable types are: "vector", "fts", "hybrid", or "auto"
- If "auto" then the query type is inferred from the query;
- If `query` is a list/np.ndarray then the query type is
"vector";
- If `query` is a PIL.Image.Image then either do vector search,
or raise an error if no corresponding embedding function is found.
- If `query` is a string, then the query type is "vector" if the
table has embedding functions else the query type is "fts"
Returns
-------
LanceQueryBuilder
A query builder object representing the query.
Once executed, the query returns
- selected columns
- the vector
- and also the "_distance" column which is the distance between the query
vector and the returned vector.
"""
raise NotImplementedError
@abstractmethod
async def _execute_query(self, query: Query) -> pa.Table:
pass
@abstractmethod
async def _do_merge(
self,
merge: LanceMergeInsertBuilder,
new_data: DATA,
on_bad_vectors: str,
fill_value: float,
):
pass
@abstractmethod
async def delete(self, where: str):
"""Delete rows from the table.
This can be used to delete a single row, many rows, all rows, or
sometimes no rows (if your predicate matches nothing).
Parameters
----------
where: str
The SQL where clause to use when deleting rows.
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
The filter must not be empty, or it will error.
Examples
--------
>>> import lancedb
>>> data = [
... {"x": 1, "vector": [1, 2]},
... {"x": 2, "vector": [3, 4]},
... {"x": 3, "vector": [5, 6]}
... ]
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 2 [3.0, 4.0]
2 3 [5.0, 6.0]
>>> table.delete("x = 2")
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 3 [5.0, 6.0]
If you have a list of values to delete, you can combine them into a
stringified list and use the `IN` operator:
>>> to_remove = [1, 5]
>>> to_remove = ", ".join([str(v) for v in to_remove])
>>> to_remove
'1, 5'
>>> table.delete(f"x IN ({to_remove})")
>>> table.to_pandas()
x vector
0 3 [5.0, 6.0]
"""
raise NotImplementedError
@abstractmethod
async def update(
self,
where: Optional[str] = None,
values: Optional[dict] = None,
*,
values_sql: Optional[Dict[str, str]] = None,
):
"""
This can be used to update zero to all rows depending on how many
rows match the where clause. If no where clause is provided, then
all rows will be updated.
Either `values` or `values_sql` must be provided. You cannot provide
both.
Parameters
----------
where: str, optional
The SQL where clause to use when updating rows. For example, 'x = 2'
or 'x IN (1, 2, 3)'. The filter must not be empty, or it will error.
values: dict, optional
The values to update. The keys are the column names and the values
are the values to set.
values_sql: dict, optional
The values to update, expressed as SQL expression strings. These can
reference existing columns. For example, {"x": "x + 1"} will increment
the x column by 1.
Examples
--------
>>> import lancedb
>>> import pandas as pd
>>> data = pd.DataFrame({"x": [1, 2, 3], "vector": [[1, 2], [3, 4], [5, 6]]})
>>> db = lancedb.connect("./.lancedb")
>>> table = db.create_table("my_table", data)
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 2 [3.0, 4.0]
2 3 [5.0, 6.0]
>>> table.update(where="x = 2", values={"vector": [10, 10]})
>>> table.to_pandas()
x vector
0 1 [1.0, 2.0]
1 3 [5.0, 6.0]
2 2 [10.0, 10.0]
>>> table.update(values_sql={"x": "x + 1"})
>>> table.to_pandas()
x vector
0 2 [1.0, 2.0]
1 4 [5.0, 6.0]
2 3 [10.0, 10.0]
"""
raise NotImplementedError
@abstractmethod
async def cleanup_old_versions(
self,
older_than: Optional[timedelta] = None,
*,
delete_unverified: bool = False,
) -> CleanupStats:
"""
Clean up old versions of the table, freeing disk space.
Note: This function is not available in LanceDb Cloud (since LanceDb
Cloud manages cleanup for you automatically)
Parameters
----------
older_than: timedelta, default None
The minimum age of the version to delete. If None, then this defaults
to two weeks.
delete_unverified: bool, default False
Because they may be part of an in-progress transaction, files newer
than 7 days old are not deleted by default. If you are sure that
there are no in-progress transactions, then you can set this to True
to delete all files older than `older_than`.
Returns
-------
CleanupStats
The stats of the cleanup operation, including how many bytes were
freed.
"""
raise NotImplementedError
@abstractmethod
async def compact_files(self, *args, **kwargs):
"""
Run the compaction process on the table.
Note: This function is not available in LanceDb Cloud (since LanceDb
Cloud manages compaction for you automatically)
This can be run after making several small appends to optimize the table
for faster reads.
Arguments are passed onto :meth:`lance.dataset.DatasetOptimizer.compact_files`.
For most cases, the default should be fine.
"""
raise NotImplementedError
@abstractmethod
async def add_columns(self, transforms: Dict[str, str]):
"""
Add new columns with defined values.
This is not yet available in LanceDB Cloud.
Parameters
----------
transforms: Dict[str, str]
A map of column name to a SQL expression to use to calculate the
value of the new column. These expressions will be evaluated for
each row in the table, and can reference existing columns.
"""
raise NotImplementedError
@abstractmethod
async def alter_columns(self, alterations: Iterable[Dict[str, str]]):
"""
Alter column names and nullability.
This is not yet available in LanceDB Cloud.
alterations : Iterable[Dict[str, Any]]
A sequence of dictionaries, each with the following keys:
- "path": str
The column path to alter. For a top-level column, this is the name.
For a nested column, this is the dot-separated path, e.g. "a.b.c".
- "name": str, optional
The new name of the column. If not specified, the column name is
not changed.
- "nullable": bool, optional
Whether the column should be nullable. If not specified, the column
nullability is not changed. Only non-nullable columns can be changed
to nullable. Currently, you cannot change a nullable column to
non-nullable.
"""
raise NotImplementedError
@abstractmethod
async def drop_columns(self, columns: Iterable[str]):
"""
Drop columns from the table.
This is not yet available in LanceDB Cloud.
Parameters
----------
columns : Iterable[str]
The names of the columns to drop.
"""
raise NotImplementedError
class AsyncLanceTable(AsyncTable):
def __init__(self, table: LanceDBTable):
self._inner = table
@property
@override
def name(self) -> str:
return self._inner.name()
@override
async def schema(self) -> pa.Schema:
return await self._inner.schema()
@override
async def count_rows(self, filter: Optional[str] = None) -> int:
raise NotImplementedError
async def to_pandas(self) -> "pd.DataFrame":
return self.to_arrow().to_pandas()
@override
async def to_arrow(self) -> pa.Table:
raise NotImplementedError
async def create_index(
self,
metric="L2",
num_partitions=256,
num_sub_vectors=96,
vector_column_name: str = VECTOR_COLUMN_NAME,
replace: bool = True,
accelerator: Optional[str] = None,
index_cache_size: Optional[int] = None,
):
raise NotImplementedError
@override
async def create_scalar_index(
self,
column: str,
*,
replace: bool = True,
):
raise NotImplementedError
@override
async def add(
self,
data: DATA,
mode: str = "append",
on_bad_vectors: str = "error",
fill_value: float = 0.0,
):
raise NotImplementedError
def merge_insert(self, on: Union[str, Iterable[str]]) -> LanceMergeInsertBuilder:
on = [on] if isinstance(on, str) else list(on.iter())
return LanceMergeInsertBuilder(self, on)
@override
async def search(
self,
query: Optional[Union[VEC, str, "PIL.Image.Image", Tuple]] = None,
vector_column_name: Optional[str] = None,
query_type: str = "auto",
) -> LanceQueryBuilder:
raise NotImplementedError
@override
async def _execute_query(self, query: Query) -> pa.Table:
pass
@override
async def _do_merge(
self,
merge: LanceMergeInsertBuilder,
new_data: DATA,
on_bad_vectors: str,
fill_value: float,
):
pass
@override
async def delete(self, where: str):
raise NotImplementedError
@override
async def update(
self,
where: Optional[str] = None,
values: Optional[dict] = None,
*,
values_sql: Optional[Dict[str, str]] = None,
):
raise NotImplementedError
@override
async def cleanup_old_versions(
self,
older_than: Optional[timedelta] = None,
*,
delete_unverified: bool = False,
) -> CleanupStats:
raise NotImplementedError
@override
async def compact_files(self, *args, **kwargs):
raise NotImplementedError
@override
async def add_columns(self, transforms: Dict[str, str]):
raise NotImplementedError
@override
async def alter_columns(self, alterations: Iterable[Dict[str, str]]):
raise NotImplementedError
@override
async def drop_columns(self, columns: Iterable[str]):
raise NotImplementedError

View File

@@ -1,5 +1,4 @@
from click.testing import CliRunner
from lancedb.cli.cli import cli
from lancedb.utils import CONFIG

View File

@@ -13,7 +13,6 @@
import pandas as pd
import pytest
from lancedb.context import contextualize

View File

@@ -11,12 +11,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import lancedb
import numpy as np
import pandas as pd
import pyarrow as pa
import pytest
import lancedb
from lancedb.pydantic import LanceModel, Vector
@@ -166,6 +165,24 @@ def test_table_names(tmp_path):
assert db.table_names() == ["test1", "test2", "test3"]
@pytest.mark.asyncio
async def test_table_names_async(tmp_path):
db = lancedb.connect(tmp_path)
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
db.create_table("test2", data=data)
db.create_table("test1", data=data)
db.create_table("test3", data=data)
db = await lancedb.connect_async(tmp_path)
assert await db.table_names() == ["test1", "test2", "test3"]
def test_create_mode(tmp_path):
db = lancedb.connect(tmp_path)
data = pd.DataFrame(
@@ -233,6 +250,78 @@ def test_create_exist_ok(tmp_path):
db.create_table("test", schema=bad_schema, exist_ok=True)
@pytest.mark.asyncio
async def test_create_mode_async(tmp_path):
db = await lancedb.connect_async(tmp_path)
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
await db.create_table("test", data=data)
with pytest.raises(RuntimeError):
await db.create_table("test", data=data)
new_data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["fizz", "buzz"],
"price": [10.0, 20.0],
}
)
_tbl = await db.create_table("test", data=new_data, mode="overwrite")
# MIGRATION: to_pandas() is not available in async
# assert tbl.to_pandas().item.tolist() == ["fizz", "buzz"]
@pytest.mark.asyncio
async def test_create_exist_ok_async(tmp_path):
db = await lancedb.connect_async(tmp_path)
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
tbl = await db.create_table("test", data=data)
with pytest.raises(RuntimeError):
await db.create_table("test", data=data)
# open the table but don't add more rows
tbl2 = await db.create_table("test", data=data, exist_ok=True)
assert tbl.name == tbl2.name
assert await tbl.schema() == await tbl2.schema()
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32(), list_size=2)),
pa.field("item", pa.utf8()),
pa.field("price", pa.float64()),
]
)
tbl3 = await db.create_table("test", schema=schema, exist_ok=True)
assert await tbl3.schema() == schema
# Migration: When creating a table, but the table already exists, but
# the schema is different, it should raise an error.
# bad_schema = pa.schema(
# [
# pa.field("vector", pa.list_(pa.float32(), list_size=2)),
# pa.field("item", pa.utf8()),
# pa.field("price", pa.float64()),
# pa.field("extra", pa.float32()),
# ]
# )
# with pytest.raises(ValueError):
# await db.create_table("test", schema=bad_schema, exist_ok=True)
def test_delete_table(tmp_path):
db = lancedb.connect(tmp_path)
data = pd.DataFrame(

View File

@@ -13,7 +13,6 @@
import numpy as np
import pytest
from lancedb import LanceDBConnection
# TODO: setup integ test mark and script

View File

@@ -13,11 +13,10 @@
import sys
import lance
import lancedb
import numpy as np
import pyarrow as pa
import pytest
import lancedb
from lancedb.conftest import MockTextEmbeddingFunction
from lancedb.embeddings import (
EmbeddingFunctionConfig,

View File

@@ -14,12 +14,11 @@ import importlib
import io
import os
import lancedb
import numpy as np
import pandas as pd
import pytest
import requests
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
@@ -185,10 +184,9 @@ def test_imagebind(tmp_path):
import shutil
import tempfile
import lancedb.embeddings.imagebind
import pandas as pd
import requests
import lancedb.embeddings.imagebind
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

View File

@@ -14,13 +14,13 @@ import os
import random
from unittest import mock
import lancedb as ldb
import numpy as np
import pandas as pd
import pytest
import tantivy
import lancedb as ldb
import lancedb.fts
pytest.importorskip("lancedb.fts")
tantivy = pytest.importorskip("tantivy")
@pytest.fixture

View File

@@ -13,9 +13,8 @@
import os
import pytest
import lancedb
import pytest
# You need to setup AWS credentials an a base path to run this test. Example
# AWS_PROFILE=default TEST_S3_BASE_URL=s3://my_bucket/dataset pytest tests/test_io.py

View File

@@ -20,9 +20,8 @@ from typing import List, Optional, Tuple
import pyarrow as pa
import pydantic
import pytest
from pydantic import Field
from lancedb.pydantic import PYDANTIC_VERSION, LanceModel, Vector, pydantic_to_schema
from pydantic import Field
@pytest.mark.skipif(

View File

@@ -18,7 +18,6 @@ import numpy as np
import pandas.testing as tm
import pyarrow as pa
import pytest
from lancedb.db import LanceDBConnection
from lancedb.pydantic import LanceModel, Vector
from lancedb.query import LanceVectorQueryBuilder, Query
@@ -88,13 +87,24 @@ def test_query_builder(table):
rs = (
LanceVectorQueryBuilder(table, [0, 0], "vector")
.limit(1)
.select(["id"])
.select(["id", "vector"])
.to_list()
)
assert rs[0]["id"] == 1
assert all(np.array(rs[0]["vector"]) == [1, 2])
def test_dynamic_projection(table):
rs = (
LanceVectorQueryBuilder(table, [0, 0], "vector")
.limit(1)
.select({"id": "id", "id2": "id * 2"})
.to_list()
)
assert rs[0]["id"] == 1
assert rs[0]["id2"] == 2
def test_query_builder_with_filter(table):
rs = LanceVectorQueryBuilder(table, [0, 0], "vector").where("id = 2").to_list()
assert rs[0]["id"] == 2

View File

@@ -17,7 +17,6 @@ import pandas as pd
import pyarrow as pa
import pytest
from aiohttp import web
from lancedb.remote.client import RestfulLanceDBClient, VectorQuery

View File

@@ -11,9 +11,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import pyarrow as pa
import lancedb
import pyarrow as pa
from lancedb.remote.client import VectorQuery, VectorQueryResult

View File

@@ -1,9 +1,8 @@
import os
import lancedb
import numpy as np
import pytest
import lancedb
from lancedb.conftest import MockTextEmbeddingFunction # noqa
from lancedb.embeddings import EmbeddingFunctionRegistry
from lancedb.pydantic import LanceModel, Vector
@@ -15,6 +14,9 @@ from lancedb.rerankers import (
)
from lancedb.table import LanceTable
# Tests rely on FTS index
pytest.importorskip("lancedb.fts")
def get_test_table(tmp_path):
db = lancedb.connect(tmp_path)
@@ -102,7 +104,7 @@ def test_linear_combination(tmp_path):
query = "Our father who art in heaven"
query_vector = table.to_pandas()["vector"][0]
result = (
table.search((query_vector, query))
table.search(vector=query_vector, text=query, query_type="hybrid")
.limit(30)
.rerank(normalize="score")
.to_arrow()
@@ -116,6 +118,32 @@ def test_linear_combination(tmp_path):
"be descending."
)
# automatically deduce the query type
result = (
table.search(vector=query_vector, text=query)
.limit(30)
.rerank(normalize="score")
.to_arrow()
)
# wrong query type raises an error
with pytest.raises(ValueError):
table.search(vector=query_vector, text=query, query_type="vector").rerank(
normalize="score"
)
with pytest.raises(ValueError):
table.search(vector=query_vector, text=query, query_type="fts").rerank(
normalize="score"
)
# raise an error if only vector or text is provided
with pytest.raises(ValueError):
table.search(vector=query_vector).to_arrow()
with pytest.raises(ValueError):
table.search(text=query).to_arrow()
@pytest.mark.skipif(
os.environ.get("COHERE_API_KEY") is None, reason="COHERE_API_KEY not set"
@@ -139,7 +167,7 @@ def test_cohere_reranker(tmp_path):
query = "Our father who art in heaven"
query_vector = table.to_pandas()["vector"][0]
result = (
table.search((query_vector, query))
table.search(vector=query_vector, text=query)
.limit(30)
.rerank(reranker=CohereReranker())
.to_arrow()
@@ -173,7 +201,7 @@ def test_cross_encoder_reranker(tmp_path):
query = "Our father who art in heaven"
query_vector = table.to_pandas()["vector"][0]
result = (
table.search((query_vector, query), query_type="hybrid")
table.search(vector=query_vector, text=query, query_type="hybrid")
.limit(30)
.rerank(reranker=CrossEncoderReranker())
.to_arrow()
@@ -207,7 +235,7 @@ def test_colbert_reranker(tmp_path):
query = "Our father who art in heaven"
query_vector = table.to_pandas()["vector"][0]
result = (
table.search((query_vector, query))
table.search(vector=query_vector, text=query)
.limit(30)
.rerank(reranker=ColbertReranker())
.to_arrow()
@@ -244,7 +272,7 @@ def test_openai_reranker(tmp_path):
query = "Our father who art in heaven"
query_vector = table.to_pandas()["vector"][0]
result = (
table.search((query_vector, query))
table.search(vector=query_vector, text=query)
.limit(30)
.rerank(reranker=OpenaiReranker())
.to_arrow()

View File

@@ -20,19 +20,18 @@ from typing import List
from unittest.mock import PropertyMock, patch
import lance
import lancedb
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
import pytest
from pydantic import BaseModel
import lancedb
from lancedb.conftest import MockTextEmbeddingFunction
from lancedb.db import LanceDBConnection
from lancedb.embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
from lancedb.pydantic import LanceModel, Vector
from lancedb.table import LanceTable
from pydantic import BaseModel
class MockDB:
@@ -804,6 +803,9 @@ def test_count_rows(db):
def test_hybrid_search(db, tmp_path):
# This test uses an FTS index
pytest.importorskip("lancedb.fts")
db = MockDB(str(tmp_path))
# Create a LanceDB table schema with a vector and a text column
emb = EmbeddingFunctionRegistry.get_instance().get("test")()
@@ -898,3 +900,29 @@ def test_restore_consistency(tmp_path):
table.add([{"id": 2}])
assert table_fixed.version == table.version - 1
assert table_ref_latest.version == table.version
# Schema evolution
def test_add_columns(tmp_path):
db = lancedb.connect(tmp_path)
data = pa.table({"id": [0, 1]})
table = LanceTable.create(db, "my_table", data=data)
table.add_columns({"new_col": "id + 2"})
assert table.to_arrow().column_names == ["id", "new_col"]
assert table.to_arrow()["new_col"].to_pylist() == [2, 3]
def test_alter_columns(tmp_path):
db = lancedb.connect(tmp_path)
data = pa.table({"id": [0, 1]})
table = LanceTable.create(db, "my_table", data=data)
table.alter_columns({"path": "id", "rename": "new_id"})
assert table.to_arrow().column_names == ["new_id"]
def test_drop_columns(tmp_path):
db = lancedb.connect(tmp_path)
data = pa.table({"id": [0, 1], "category": ["a", "b"]})
table = LanceTable.create(db, "my_table", data=data)
table.drop_columns(["category"])
assert table.to_arrow().column_names == ["id"]

View File

@@ -1,8 +1,7 @@
import json
import pytest
import lancedb
import pytest
from lancedb.utils.events import _Events

View File

@@ -15,7 +15,6 @@ import os
import pathlib
import pytest
from lancedb.util import get_uri_scheme, join_uri

View File

@@ -1,17 +0,0 @@
# Copyright 2023 LanceDB Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import setuptools
if __name__ == "__main__":
setuptools.setup()

Some files were not shown because too many files have changed in this diff Show More