fix(python): Add python azure blob read support (#1102)

I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
This commit is contained in:
Christian Di Lorenzo
2024-03-15 17:15:41 -04:00
committed by Weston Pace
parent 1ea0c33545
commit 8bb983bc3d
3 changed files with 49 additions and 4 deletions

View File

@@ -16,16 +16,35 @@ import os
import lancedb
import pytest
# AWS:
# You need to setup AWS credentials an a base path to run this test. Example
# AWS_PROFILE=default TEST_S3_BASE_URL=s3://my_bucket/dataset pytest tests/test_io.py
#
# Azure:
# You need to setup Azure credentials an a base path to run this test. Example
# export AZURE_STORAGE_ACCOUNT_NAME="<account>"
# export AZURE_STORAGE_ACCOUNT_KEY="<key>"
# export REMOTE_BASE_URL=az://my_blob/dataset
# pytest tests/test_io.py
@pytest.fixture(autouse=True, scope="module")
def setup():
yield
if remote_url := os.environ.get("REMOTE_BASE_URL"):
db = lancedb.connect(remote_url)
for table in db.table_names():
db.drop_table(table)
@pytest.mark.skipif(
(os.environ.get("TEST_S3_BASE_URL") is None),
reason="please setup s3 base url",
(os.environ.get("REMOTE_BASE_URL") is None),
reason="please setup remote base url",
)
def test_s3_io():
db = lancedb.connect(os.environ.get("TEST_S3_BASE_URL"))
def test_remote_io():
db = lancedb.connect(os.environ.get("REMOTE_BASE_URL"))
assert db.table_names() == []
table = db.create_table(