Commit Graph

78 Commits

Author SHA1 Message Date
Bert
6eb662de9b fix: python remote correct open_table error message (#659) 2023-11-24 19:28:33 -05:00
Rok Mihevc
d8e3e54226 feat(python): expose index cache size (#655)
This is to enable https://github.com/lancedb/lancedb/issues/641.
Should be merged after https://github.com/lancedb/lance/pull/1587 is
released.
2023-11-18 14:17:40 -08:00
Ayush Chaurasia
1e8678f11a Multi-task instructor model with quantization support & weak_lru cache for embedding function models (#612)
resolves #608
2023-11-09 12:34:18 +05:30
Lei Xu
554e068917 chore: improve create_table API consistency between local and remote SDK (#627) 2023-11-03 13:15:11 -07:00
Ayush Chaurasia
1589499f89 Exponential standoff retry support for handling rate limited embedding functions (#614)
Users ingesting data using rate limited apis don't need to manually make
the process sleep for counter rate limits
resolves #579
2023-11-02 19:20:10 +05:30
Bert
24111d543a fix!: sort table names (#619)
https://github.com/lancedb/lance/issues/1385
2023-11-01 10:50:09 -04:00
Weston Pace
b517134309 feat: allow prefiltering with index (#610)
Support for prefiltering with an index was added in lance version 0.8.7.
We can remove the lancedb check that prevents this. Closes #261
2023-10-31 13:11:03 -07:00
Chang She
cd9debc3b7 fix(python): fix multiple embedding functions bug (#597)
Closes #594

The embedding functions are pydantic models so multiple instances with
the same parameters are considered ==, which means that if you have
multiple embedding columns it's possible for the embeddings to get
overwritten. Instead we use `is` instead of == to avoid this problem.

testing: modified unit test to include this case
2023-10-24 13:05:05 -04:00
Ayush Chaurasia
0293bbe142 [Python]Embeddings API refactor (#580)
Sets things up for this -> https://github.com/lancedb/lancedb/issues/579
- Just separates out the registry/ingestion code from the function
implementation code
- adds a `get_registry` util
- package name "open-clip" -> "open-clip-torch"
2023-10-17 22:32:19 -07:00
Prashanth Rao
86efb11572 Add pyarrow date and timestamp type conversion from pydantic (#576) 2023-10-16 19:42:24 -07:00
Lei Xu
eff94ecea8 chore: bump lance to 0.8.5 (#561)
Bump lance to 0.5.8
2023-10-14 12:38:43 -07:00
Ayush Chaurasia
683824f1e9 Add cohere embedding function (#550) 2023-10-13 16:27:34 +05:30
Will Jones
db7bdefe77 feat: cleanup and compaction (#518)
#488
2023-10-11 12:49:12 -07:00
Chang She
e1ae2bcbd8 feat: add to_list and to_pandas api's (#556)
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes #555

Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545

Co-authored-by: Chang She <chang@lancedb.com>
2023-10-11 12:18:55 -07:00
Ayush Chaurasia
a1377afcaa feat: telemetry, error tracking, CLI & config manager (#538)
Co-authored-by: Lance Release <lance-dev@lancedb.com>
Co-authored-by: Rob Meng <rob.xu.meng@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: rmeng <rob@lancedb.com>
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
2023-10-08 23:11:39 +05:30
Lei Xu
a26c8f3316 feat: use GPU for index creation. (#540)
Bump lance to 0.8.3 to include GPU training

---------

Co-authored-by: Rob Meng <rob.xu.meng@gmail.com>
2023-10-05 20:49:00 -07:00
Chang She
693bca1eba feat(python): expose prefilter to lancedb (#522)
We have experimental support for prefiltering (without ANN) in pylance.
This means that we can now apply a filter BEFORE vector search is
performed. This can be done via the `.where(filter_string,
prefilter=True)` kwargs of the query.

Limitations:
- When connecting to LanceDB cloud, `prefilter=True` will raise
NotImplemented
- When an ANN index is present, `prefilter=True` will raise
NotImplemented
- This option is not available for full text search query
- This option is not available for empty search query (just
filter/project)

Additional changes in this PR:
- Bump pylance version to v0.8.0 which supports the experimental
prefiltering.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-10-01 10:34:12 -07:00
Rob Meng
a695fb8030 fix import attr to use import attrs (#510)
Thanks to #508, I used `attr` instead of the correct package `attrs`

s/attr/attrs
2023-09-23 00:30:56 -04:00
Chang She
c21f9cdda0 ci: fix docs build (#496)
python/python.md contains typos in the class references

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-09-18 13:07:21 -07:00
Chang She
31dad71c94 multi-modal embedding-function (#484) 2023-09-16 21:23:51 -04:00
Lei Xu
b315ea3978 [Python] Pydantic vector field with default value (#474)
Rename `lance.pydantic.vector` to `Vector` and deprecate `vector(dim)`
2023-09-08 22:35:31 -07:00
Ayush Chaurasia
aa7806cf0d [Python]Fix record_batch_generator (#483)
Should fix - https://github.com/lancedb/lancedb/issues/482
2023-09-08 21:18:50 +05:30
Chang She
9a9a73a65d [python] Use pydantic for embedding function persistence (#467)
1. Support persistent embedding function so users can just search using
query string
2. Add fixed size list conversion for multiple vector columns
3. Add support for empty query (just apply select/where/limit).
4. Refactor and simplify some of the data prep code

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-09-05 21:30:45 -07:00
Chang She
0cba0f4f92 [python] Temporary update feature (#457)
Combine delete and append to make a temporary update feature that is
only enabled for the local python lancedb.

The reason why this is temporary is because it first has to load the
data that matches the where clause into memory, which is technical
unbounded.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-30 00:25:26 -07:00
Chang She
e587a17a64 [python] Support schema evolution in local LanceDB (#452)
Previously if you needed to add a column to a table you'd have to
rewrite the whole table. Instead,
we use the merge functionality from Lance format
to incrementally add columns from another table
or dataframe.

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-24 14:40:49 -07:00
Chang She
2f1f9f6338 [python] improve restore functionality (#451)
Previously the temporary restore feature required copying data. The new
feature in pylance does not.

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-24 11:00:34 -07:00
Ayush Chaurasia
0b9924b432 Make creating (and adding to) tables via Iterators more flexible & intuitive (#430)
It improves the UX as iterators can be of any type supported by the
table (plus recordbatch) & there is no separate requirement.
Also expands the test cases for pydantic & arrow schema.
If this is looks good I'll update the docs.

Example usage:
```
class Content(LanceModel):
    vector: vector(2)
    item: str
    price: float

def make_batches():
    for _ in range(5):
        yield from [ 
        # pandas
        pd.DataFrame({
            "vector": [[3.1, 4.1], [1, 1]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }),
        
        # pylist
        [
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],

        # recordbatch
        pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ], 
            ["vector", "item", "price"],
        ),

        # pydantic list
        [
            Content(vector=[3.1, 4.1], item="foo", price=10.0),
            Content(vector=[5.9, 26.5], item="bar", price=20.0),
        ]]

db = lancedb.connect("db")
tbl = db.create_table("tabley", make_batches(), schema=Content, mode="overwrite")

tbl.add(make_batches())
```
Same should with arrow schema.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-08-18 09:56:30 +05:30
Chang She
e3061d4cb4 [python] Temporary restore feature (#428)
This adds LanceTable.restore as a temporary feature. It reads data from
a previous version and creates
a new snapshot version using that data. This makes the version writeable
unlike checkout. This should be replaced once the feature is implemented
in pylance.

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-14 20:10:29 -07:00
Will Jones
722462c38b chore: upgrade Lance and rename score to _distance (#398)
BREAKING CHANGE: The `score` column has been renamed to `_distance` to
more accurately describe the semantics (smaller means closer / better).

---------

Co-authored-by: Lei Xu <lei@lancedb.com>
2023-08-11 21:42:33 -07:00
Ashis Kumar Naik
902a402951 implementation of drop_database (#418)
#416 Fixed.

added drop_database() method . This deletes all the tables from the
database with a single command.

---------

Signed-off-by: Ashis Kumar Naik <ashishami2002@gmail.com>
2023-08-11 20:59:56 -07:00
Chang She
a54d1e5618 Automatically convert pydantic model (#400)
Saves users from having to explicitly call
`LanceModel.to_arrow_schema()` when creating an empty table.
See new docs for full details.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-08-06 14:50:03 -07:00
Ayush Chaurasia
bbfadfe58d [python] Allow adding via iterators (#391)
Makes the following work so all the formats accepted by `create_table()`
are also accepted by `add()`
```
import lancedb
import pyarrow as pa

db = lancedb.connect("/tmp")

def make_batches():
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]]),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"],
        )

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32())),
    pa.field("item", pa.utf8()),
    pa.field("price", pa.float32()),
])

tbl = db.create_table("table4", make_batches(), schema=schema)
tbl.add(make_batches())
```
2023-08-04 12:49:44 -07:00
Chang She
cada35d5b7 Improve pydantic integration (#384) 2023-07-31 12:16:44 -04:00
Chang She
2d25c263e9 Implement drop table if exists (#383) 2023-07-31 10:25:09 +02:00
Lei Xu
63acdc2069 [Python] Support pydantic v1 as well (#337)
Support both Pydantic v1 and v2 (breaking changes)
2023-07-18 19:53:09 -07:00
Lei Xu
f09db4a6d6 [Python] Do not return Table count for every add operation (#328)
`Table::count()` will be linearly slower with more fragments ingested.
2023-07-18 17:11:17 -07:00
Lei Xu
088e745e1d [Python] Create table with Iterator[RecordBatch] and add docs (#316) 2023-07-16 21:45:55 -07:00
Lei Xu
028a6e433d [Python] Get table schema (#313) 2023-07-15 17:39:37 -07:00
Chang She
2fdcb307eb [python] Fix a few minor bugs (#304) 2023-07-15 03:47:42 +08:00
Lei Xu
08944bf4fd [Python] Convert Pydantic Model to Arrow Schema (#291)
Provide utility to automatically convert Pydantic model to Arrow Schema

Closes #256
2023-07-13 11:16:37 -07:00
Lei Xu
0e7ae5dfbf [Python] Fix list type conversion to JSON and temporal types (#274) 2023-07-11 11:05:51 -07:00
Lei Xu
9f603f73a9 [Python] Schema to JSON (#272) 2023-07-10 18:11:24 -07:00
Lei Xu
e6c6da6104 [Python] Initial support of cloud API (#260)
Support connect with remote database, and implement Search API
2023-07-07 15:41:15 -07:00
Chang She
e2325c634b Allow creation of an empty table (#254)
It's inconvenient to always require data at table creation time.
Here we enable you to create an empty table and add data and set schema
later.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-06 20:44:58 -07:00
Chang She
507eeae9c8 Set default to error instead of drop (#259)
when encountering bad input data, we can default to principle of least
surprise and raise an exception.

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-05 22:44:18 -07:00
Chang She
3c46d7f268 Handle NaN input data (#241)
Sometimes LangChain would insert a single `[np.nan]` as a placeholder if
the embedding function failed. This causes a problem for Lance format
because then the array can't be stored as a FixedSizedListArray.

Instead:
1. By default we remove rows with embedding lengths less than the
maximum length in the batch
2. If `strict=True` kwargs is set to True, then a `ValueError` is raised
if the embeddings aren't all the same length

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-04 20:00:46 -07:00
Leon Yee
eb5bcda337 Error implementations (#232)
Solves #216 by adding a check on table open for existence of the
`.lance` file. Does not check for it for remote connections.
2023-06-27 16:48:31 -07:00
Lei Xu
4bc676e26a [Python] Support replace during create_index (#233)
Closes #214
2023-06-27 16:02:07 -07:00
Philip Kung
313e66c4c5 Specify and Index Column for Vector Search (#217) 2023-06-26 16:11:08 -07:00
Rob Meng
d1e8a97a2a isort entire repo (#200) 2023-06-15 20:12:10 -04:00