lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-03 19:10:41 +00:00

Author	SHA1	Message	Date
Chang She	7e75e50d3a	chore(python): update embedding API to use openai 1.6.1 (#751 ) API has changed significantly, namely `openai.Embedding.create` no longer exists. https://github.com/openai/openai-python/discussions/742 Update the OpenAI embedding function and put a minimum on the openai sdk version.	2023-12-28 15:05:57 -08:00
Chang She	4b8af261a3	feat: add timezone handling for datetime in pydantic (#578 ) If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.	2023-12-28 11:02:56 -08:00
Chang She	c8728d4ca1	feat(python): add post filtering for full text search (#739 ) Closes #721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once #676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-27 09:31:04 -08:00
Chang She	8f9ad978f5	feat(python): support list of list fields from pydantic schema (#747 ) For object detection, each row may correspond to an image and each image can have multiple bounding boxes of x-y coordinates. This means that a `bbox` field is potentially "list of list of float". This adds support in our pydantic-pyarrow conversion for nested lists.	2023-12-27 09:10:09 -08:00
Weston Pace	dc5126d8d1	feat: add the ability to create scalar indices (#679 ) This is a pretty direct binding to the underlying lance capability	2023-12-21 09:50:10 -08:00
Chang She	7bbb2872de	bug(python): fix path handling in windows (#724 ) Use pathlib for local paths so that pathlib can handle the correct separator on windows. Closes #703 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-20 15:41:36 -08:00
Will Jones	f9dd7a5d8a	fix: prevent duplicate data in FTS index (#728 ) This forces the user to replace the whole FTS directory when re-creating the index, prevent duplicate data from being created. Previously, the whole dataset was re-added to the existing index, duplicating existing rows in the index. This (in combination with lancedb/lance#1707) caused #726, since the duplicate data emitted duplicate indices for `take()` and an upstream issue caused those queries to fail. This solution isn't ideal, since it makes the FTS index temporarily unavailable while the index is built. In the future, we should have multiple FTS index directories, which would allow atomic commits of new indexes (as well as multiple indexes for different columns). Fixes #498. Fixes #726. --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2023-12-20 13:07:07 -08:00
Will Jones	1d4943688d	upgrade lance to v0.9.1 (#727 ) This brings in some important bugfixes related to take and aarch64 Linux. See changes at: https://github.com/lancedb/lance/releases/tag/v0.9.1	2023-12-20 13:06:54 -08:00
Chang She	7856a94d2c	feat(python): support nested reference for fts (#723 ) https://github.com/lancedb/lance/issues/1739 Support nested field reference in full text search --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-20 12:28:53 -08:00
Chang She	371d2f979e	feat(python): add option to flatten output in to_pandas (#722 ) Closes https://github.com/lancedb/lance/issues/1738 We add a `flatten` parameter to the signature of `to_pandas`. By default this is None and does nothing. If set to True or -1, then LanceDB will flatten structs before converting to a pandas dataframe. All nested structs are also flattened. If set to any positive integer, then LanceDB will flatten structs up to the specified level of nesting. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-12-20 12:23:07 -08:00
Chang She	bd0034a157	feat: support nested pydantic schema (#707 )	2023-12-14 18:20:45 -08:00
Will Jones	d087e7891d	feat(python): add update query support for Python (#654 ) Closes #69 Will not pass until https://github.com/lancedb/lance/pull/1585 is released	2023-12-14 11:28:32 -08:00
Bert	6eb662de9b	fix: python remote correct open_table error message (#659 )	2023-11-24 19:28:33 -05:00
Rok Mihevc	d8e3e54226	feat(python): expose index cache size (#655 ) This is to enable https://github.com/lancedb/lancedb/issues/641. Should be merged after https://github.com/lancedb/lance/pull/1587 is released.	2023-11-18 14:17:40 -08:00
Ayush Chaurasia	1e8678f11a	Multi-task instructor model with quantization support & weak_lru cache for embedding function models (#612 ) resolves #608	2023-11-09 12:34:18 +05:30
Lei Xu	554e068917	chore: improve create_table API consistency between local and remote SDK (#627 )	2023-11-03 13:15:11 -07:00
Ayush Chaurasia	1589499f89	Exponential standoff retry support for handling rate limited embedding functions (#614 ) Users ingesting data using rate limited apis don't need to manually make the process sleep for counter rate limits resolves #579	2023-11-02 19:20:10 +05:30
Bert	24111d543a	fix!: sort table names (#619 ) https://github.com/lancedb/lance/issues/1385	2023-11-01 10:50:09 -04:00
Weston Pace	b517134309	feat: allow prefiltering with index (#610 ) Support for prefiltering with an index was added in lance version 0.8.7. We can remove the lancedb check that prevents this. Closes #261	2023-10-31 13:11:03 -07:00
Chang She	cd9debc3b7	fix(python): fix multiple embedding functions bug (#597 ) Closes #594 The embedding functions are pydantic models so multiple instances with the same parameters are considered ==, which means that if you have multiple embedding columns it's possible for the embeddings to get overwritten. Instead we use `is` instead of == to avoid this problem. testing: modified unit test to include this case	2023-10-24 13:05:05 -04:00
Ayush Chaurasia	0293bbe142	[Python]Embeddings API refactor (#580 ) Sets things up for this -> https://github.com/lancedb/lancedb/issues/579 - Just separates out the registry/ingestion code from the function implementation code - adds a `get_registry` util - package name "open-clip" -> "open-clip-torch"	2023-10-17 22:32:19 -07:00
Prashanth Rao	86efb11572	Add pyarrow date and timestamp type conversion from pydantic (#576 )	2023-10-16 19:42:24 -07:00
Lei Xu	eff94ecea8	chore: bump lance to 0.8.5 (#561 ) Bump lance to 0.5.8	2023-10-14 12:38:43 -07:00
Ayush Chaurasia	683824f1e9	Add cohere embedding function (#550 )	2023-10-13 16:27:34 +05:30
Will Jones	db7bdefe77	feat: cleanup and compaction (#518 ) #488	2023-10-11 12:49:12 -07:00
Chang She	e1ae2bcbd8	feat: add to_list and to_pandas api's (#556 ) Add `to_list` to return query results as list of python dict (so we're not too pandas-centric). Closes #555 Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545 Co-authored-by: Chang She <chang@lancedb.com>	2023-10-11 12:18:55 -07:00
Ayush Chaurasia	a1377afcaa	feat: telemetry, error tracking, CLI & config manager (#538 ) Co-authored-by: Lance Release <lance-dev@lancedb.com> Co-authored-by: Rob Meng <rob.xu.meng@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: rmeng <rob@lancedb.com> Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Rok Mihevc <rok@mihevc.org>	2023-10-08 23:11:39 +05:30
Lei Xu	a26c8f3316	feat: use GPU for index creation. (#540 ) Bump lance to 0.8.3 to include GPU training --------- Co-authored-by: Rob Meng <rob.xu.meng@gmail.com>	2023-10-05 20:49:00 -07:00
Chang She	693bca1eba	feat(python): expose prefilter to lancedb (#522 ) We have experimental support for prefiltering (without ANN) in pylance. This means that we can now apply a filter BEFORE vector search is performed. This can be done via the `.where(filter_string, prefilter=True)` kwargs of the query. Limitations: - When connecting to LanceDB cloud, `prefilter=True` will raise NotImplemented - When an ANN index is present, `prefilter=True` will raise NotImplemented - This option is not available for full text search query - This option is not available for empty search query (just filter/project) Additional changes in this PR: - Bump pylance version to v0.8.0 which supports the experimental prefiltering. --------- Co-authored-by: Chang She <chang@lancedb.com>	2023-10-01 10:34:12 -07:00
Rob Meng	a695fb8030	fix `import attr` to use `import attrs` (#510 ) Thanks to #508, I used `attr` instead of the correct package `attrs` s/attr/attrs	2023-09-23 00:30:56 -04:00
Chang She	c21f9cdda0	ci: fix docs build (#496 ) python/python.md contains typos in the class references --------- Co-authored-by: Chang She <chang@lancedb.com>	2023-09-18 13:07:21 -07:00
Chang She	31dad71c94	multi-modal embedding-function (#484 )	2023-09-16 21:23:51 -04:00
Lei Xu	b315ea3978	[Python] Pydantic vector field with default value (#474 ) Rename `lance.pydantic.vector` to `Vector` and deprecate `vector(dim)`	2023-09-08 22:35:31 -07:00
Ayush Chaurasia	aa7806cf0d	[Python]Fix record_batch_generator (#483 ) Should fix - https://github.com/lancedb/lancedb/issues/482	2023-09-08 21:18:50 +05:30
Chang She	9a9a73a65d	[python] Use pydantic for embedding function persistence (#467 ) 1. Support persistent embedding function so users can just search using query string 2. Add fixed size list conversion for multiple vector columns 3. Add support for empty query (just apply select/where/limit). 4. Refactor and simplify some of the data prep code --------- Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-09-05 21:30:45 -07:00
Chang She	0cba0f4f92	[python] Temporary update feature (#457 ) Combine delete and append to make a temporary update feature that is only enabled for the local python lancedb. The reason why this is temporary is because it first has to load the data that matches the where clause into memory, which is technical unbounded. --------- Co-authored-by: Chang She <chang@lancedb.com>	2023-08-30 00:25:26 -07:00
Chang She	e587a17a64	[python] Support schema evolution in local LanceDB (#452 ) Previously if you needed to add a column to a table you'd have to rewrite the whole table. Instead, we use the merge functionality from Lance format to incrementally add columns from another table or dataframe. --------- Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-08-24 14:40:49 -07:00
Chang She	2f1f9f6338	[python] improve restore functionality (#451 ) Previously the temporary restore feature required copying data. The new feature in pylance does not. --------- Co-authored-by: Chang She <chang@lancedb.com> Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-08-24 11:00:34 -07:00
Ayush Chaurasia	0b9924b432	Make creating (and adding to) tables via Iterators more flexible & intuitive (#430 ) It improves the UX as iterators can be of any type supported by the table (plus recordbatch) & there is no separate requirement. Also expands the test cases for pydantic & arrow schema. If this is looks good I'll update the docs. Example usage: ``` class Content(LanceModel): vector: vector(2) item: str price: float def make_batches(): for _ in range(5): yield from [ # pandas pd.DataFrame({ "vector": [[3.1, 4.1], [1, 1]], "item": ["foo", "bar"], "price": [10.0, 20.0], }), # pylist [ {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}, {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}, ], # recordbatch pa.RecordBatch.from_arrays( [ pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)), pa.array(["foo", "bar"]), pa.array([10.0, 20.0]), ], ["vector", "item", "price"], ), # pydantic list [ Content(vector=[3.1, 4.1], item="foo", price=10.0), Content(vector=[5.9, 26.5], item="bar", price=20.0), ]] db = lancedb.connect("db") tbl = db.create_table("tabley", make_batches(), schema=Content, mode="overwrite") tbl.add(make_batches()) ``` Same should with arrow schema. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-08-18 09:56:30 +05:30
Chang She	e3061d4cb4	[python] Temporary restore feature (#428 ) This adds LanceTable.restore as a temporary feature. It reads data from a previous version and creates a new snapshot version using that data. This makes the version writeable unlike checkout. This should be replaced once the feature is implemented in pylance. Co-authored-by: Chang She <chang@lancedb.com>	2023-08-14 20:10:29 -07:00
Will Jones	722462c38b	chore: upgrade Lance and rename score to _distance (#398 ) BREAKING CHANGE: The `score` column has been renamed to `_distance` to more accurately describe the semantics (smaller means closer / better). --------- Co-authored-by: Lei Xu <lei@lancedb.com>	2023-08-11 21:42:33 -07:00
Ashis Kumar Naik	902a402951	implementation of drop_database (#418 ) #416 Fixed. added drop_database() method . This deletes all the tables from the database with a single command. --------- Signed-off-by: Ashis Kumar Naik <ashishami2002@gmail.com>	2023-08-11 20:59:56 -07:00
Chang She	a54d1e5618	Automatically convert pydantic model (#400 ) Saves users from having to explicitly call `LanceModel.to_arrow_schema()` when creating an empty table. See new docs for full details. --------- Co-authored-by: Chang She <chang@lancedb.com>	2023-08-06 14:50:03 -07:00
Ayush Chaurasia	bbfadfe58d	[python] Allow adding via iterators (#391 ) Makes the following work so all the formats accepted by `create_table()` are also accepted by `add()` ``` import lancedb import pyarrow as pa db = lancedb.connect("/tmp") def make_batches(): for i in range(5): yield pa.RecordBatch.from_arrays( [ pa.array([[3.1, 4.1], [5.9, 26.5]]), pa.array(["foo", "bar"]), pa.array([10.0, 20.0]), ], ["vector", "item", "price"], ) schema = pa.schema([ pa.field("vector", pa.list_(pa.float32())), pa.field("item", pa.utf8()), pa.field("price", pa.float32()), ]) tbl = db.create_table("table4", make_batches(), schema=schema) tbl.add(make_batches()) ```	2023-08-04 12:49:44 -07:00
Chang She	cada35d5b7	Improve pydantic integration (#384 )	2023-07-31 12:16:44 -04:00
Chang She	2d25c263e9	Implement drop table if exists (#383 )	2023-07-31 10:25:09 +02:00
Lei Xu	63acdc2069	[Python] Support pydantic v1 as well (#337 ) Support both Pydantic v1 and v2 (breaking changes)	2023-07-18 19:53:09 -07:00
Lei Xu	f09db4a6d6	[Python] Do not return Table count for every add operation (#328 ) `Table::count()` will be linearly slower with more fragments ingested.	2023-07-18 17:11:17 -07:00
Lei Xu	088e745e1d	[Python] Create table with Iterator[RecordBatch] and add docs (#316 )	2023-07-16 21:45:55 -07:00
Lei Xu	028a6e433d	[Python] Get table schema (#313 )	2023-07-15 17:39:37 -07:00

1 2

90 Commits