lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2025-12-24 22:09:58 +00:00

Author	SHA1	Message	Date
Ayush Chaurasia	7d65dd97cf	chore(python): update Colbert architecture and minor improvements (#1547 ) - Update ColBertReranker architecture: The current implementation doesn't use the right arch. This PR uses the implementation in Rerankers library. Fixes https://github.com/lancedb/lancedb/issues/1546 Benchmark diff (hit rate): Hybrid - 91 vs 87 reranked vector - 85 vs 80 - Reranking in FTS is basically disabled in main after last week's FTS updates. I think there's no blocker in supporting that? - Allow overriding accelerators: Most transformer based Rerankers and Embedding automatically select device. This PR allows overriding those settings by passing `device`. Fixes: https://github.com/lancedb/lancedb/issues/1487 --------- Co-authored-by: BubbleCal <bubble-cal@outlook.com>	2024-08-21 12:26:52 +05:30
BubbleCal	0fa50775d6	feat: support to query/index FTS on RemoteTable/AsyncTable (#1537 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-16 12:01:05 +08:00
Gagan Bhullar	20faa4424b	feat(python): add delete unverified parameter (#1542 ) PR fixes #1527	2024-08-15 09:01:32 -07:00
Lei Xu	b2317c904d	feat: create bitmap and label list scalar index using python async api (#1529 ) * Expose `bitmap` and `LabelList` scalar index type via Rust and Async Python API * Add documents	2024-08-11 09:16:11 -07:00
Gagan Bhullar	9c1adff426	feat(python): add to_list to async api (#1520 ) PR fixes #1517	2024-08-08 11:45:20 -07:00
BubbleCal	f9d5fa88a1	feat!: migrate FTS from tantivy to lance-index (#1483 ) Lance now supports FTS, so add it into lancedb Python, TypeScript and Rust SDKs. For Python, we still use tantivy based FTS by default because the lance FTS index now misses some features of tantivy. For Python: - Support to create lance based FTS index - Support to specify columns for full text search (only available for lance based FTS index) For TypeScript: - Change the search method so that it can accept both string and vector - Support full text search For Rust - Support full text search The others: - Update the FTS doc BREAKING CHANGE: - for Python, this renames the attached score column of FTS from "score" to "_score", this could be a breaking change for users that rely the scores --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-08 15:33:15 +08:00
Lei Xu	2bdf0a02f9	feat!: upgrade lance to 0.16 (#1519 )	2024-08-07 13:15:22 -07:00
Gagan Bhullar	32123713fd	feat(python): optimize stats repr method (#1510 ) PR fixes #1507	2024-08-07 08:47:52 -07:00
Gagan Bhullar	d5a01ffe7b	feat(python): index config repr method (#1509 ) PR fixes #1506	2024-08-07 08:46:46 -07:00
Ayush Chaurasia	4769d8eb76	feat(python): multi-vector reranking support (#1481 ) Currently targeting the following usage: ``` from lancedb.rerankers import CrossEncoderReranker reranker = CrossEncoderReranker() query = "hello" res1 = table.search(query, vector_column_name="vector").limit(3) res2 = table.search(query, vector_column_name="text_vector").limit(3) res3 = table.search(query, vector_column_name="meta_vector").limit(3) reranked = reranker.rerank_multivector( [res1, res2, res3], deduplicate=True, query=query # some reranker models need query ) ``` - This implements rerank_multivector function in the base reranker so that all rerankers that implement rerank_vector will automatically have multivector reranking support - Special case for RRF reranker that just uses its existing rerank_hybrid fcn to multi-vector reranking. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-08-07 01:45:46 +05:30
Robby	8d2ff7b210	feat(python): add watsonx embeddings to registry (#1486 ) Related issue: https://github.com/lancedb/lancedb/issues/1412 --------- Co-authored-by: Robby <h0rv@users.noreply.github.com>	2024-08-06 10:58:33 +05:30
Chang She	374c1e7aba	fix: infer schema from huggingface dataset (#1444 ) Closes #1383 When creating a table from a HuggingFace dataset, infer the arrow schema directly	2024-07-23 13:12:34 -07:00
Ayush Chaurasia	0255221086	feat: add reciprocal rank fusion reranker (#1456 ) Implements https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf Refactors the hybrid search only rerrankers test to avoid repetition.	2024-07-23 21:37:17 +05:30
Weston Pace	d4aad82aec	fix: don't use v2 by default on empty table (#1469 )	2024-07-23 06:47:49 -07:00
Will Jones	4f601a2d4c	fix: handle camelCase column names in select (#1460 ) Fixes #1385	2024-07-22 12:53:17 -07:00
Ayush Chaurasia	ed7bd45c17	chore: choose appropriate args for concat_table based on pyarrow version & refactor reranker tests (#1455 )	2024-07-18 21:04:59 +05:30
Lei Xu	0708428357	feat: support update over binary field (#1440 )	2024-07-12 09:22:00 -07:00
Nuvic	46c6ff889d	feat: add the explain_plan function (#1328 ) It's useful to see the underlying query plan for debugging purposes. This exposes LanceScanner's `explain_plan` function. Addresses #1288 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-07-02 11:10:01 -07:00
Will Jones	865ed99881	feat: dynamodb commit store support (#1410 ) This allows users to specify URIs like: ``` s3+ddb://my_bucket/path?ddbTableName=myCommitTable ``` and it will support concurrent writes in S3. * [x] Add dynamodb integration tests * [x] Add modifications to get it working in Python sync API * [x] Added section in documentation describing how to configure. Closes #534 --------- Co-authored-by: universalmind303 <cory.grinstead@gmail.com>	2024-06-28 09:30:36 -07:00
josca42	0fe844034d	feat: enable stemming (#1356 ) Added the ability to specify tokenizer_name, when creating a full text search index using tantivy. This enables the use of language specific stemming. Also updated the [guide on full text search](https://lancedb.github.io/lancedb/fts/) with a short section on choosing tokenizer. Fixes #1315	2024-06-20 14:23:55 -07:00
Ayush Chaurasia	76fc16c7a1	docs: add retriever guide, address minor onboarding feedbacks & enhancement (#1326 ) - Tried to address some onboarding feedbacks listed in https://github.com/lancedb/lancedb/issues/1224 - Improve visibility of pydantic integration and embedding API. (Based on onboarding feedback - Many ways of ingesting data, defining schema but not sure what to use in a specific use-case) - Add a guide that takes users through testing and improving retriever performance using built-in utilities like hybrid-search and reranking - Add some benchmarks for the above - Add missing cohere docs --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-06-08 06:25:31 +05:30
Weston Pace	d5586c9c32	feat: make it possible to opt in to using the v2 format (#1352 ) This also exposed the max_batch_length configuration option in python/node (it was needed to verify if we are actually in v2 mode or not)	2024-06-04 21:52:14 -07:00
zhongpu	3bb7c546d7	fix: the bug of async connection context manager (#1333 ) - add `return` for `__enter__` The buggy code didn't return the object, therefore it will always return None within a context manager: ```python with await lancedb.connect_async("./.lancedb") as db: # db is always None ``` (BTW, why not to design an async context manager?) - add a unit test for Async connection context manager - update return type of `AsyncConnection.open_table` to `AsyncTable` Although type annotation doesn't affect the functionality, it is helpful for IDEs.	2024-05-29 09:33:32 -07:00
Philip Meier	1ad1c0820d	chore: replace semver dependency with packaging (#1311 ) Fixes #1296 per title. See https://github.com/lancedb/lancedb/pull/1298#discussion_r1603931457 Cc @wjones127 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-05-28 10:05:16 -07:00
Weston Pace	4f512af024	feat: add the optimize function to nodejs and async python (#1257 ) The optimize function is pretty crucial for getting good performance when building a large scale dataset but it was only exposed in rust (many sync python users are probably doing this via to_lance today) This PR adds the optimize function to nodejs and to python. I left the function marked experimental because I think there will likely be changes to optimization (e.g. if we add features like "optimize on write"). I also only exposed the `cleanup_older_than` configuration parameter since this one is very commonly used and the rest have sensible defaults and we don't really know why we would recommend different values for these defaults anyways.	2024-05-20 07:09:31 -07:00
asmith26	3850d5fb35	Add ollama embeddings function (#1263 ) Following the docs [here](https://lancedb.github.io/lancedb/python/python/#lancedb.embeddings.openai.OpenAIEmbeddings) I've been trying to use ollama embedding via the OpenAI API interface, but unfortunately I couldn't get it to work (possibly related to https://github.com/ollama/ollama/issues/2416) Given the popularity of ollama I thought it could be helpful to have a dedicated Ollama Embedding function in lancedb. Very much welcome any thought on this or my code etc. Thanks!	2024-05-13 13:09:19 +05:30
Will Jones	12dbca5248	ci: better test for test_syntax (#1278 ) The syntax error was fixed in tantivy 0.22.0, so I changed the test case to something more wrong.	2024-05-07 11:52:39 -07:00
Weston Pace	3d7c48feca	feat: allow the index_cache_size to be configured when opening a table (#1245 ) This was already configurable in the rust API but it wasn't actually being passed down to the underlying dataset. I added this option to both the async python API and the new nodejs API. I also added this option to the synchronous python API. I did not add the option to vectordb.	2024-04-26 13:42:02 -07:00
Raghav Dixit	a6aa67baed	python: Bug fixes / tests (#1210 ) closes #1194 #1172 #1124 #1208 @wjones127 : `if query_type != "fts":` is needed because both fts and vector search create `LanceQueryBuilder` which has `vector_column_name` as a required attribute.	2024-04-10 10:17:14 -07:00
Will Jones	1d23af213b	feat: expose storage options in LanceDB (#1204 ) Exposes `storage_options` in LanceDB. This is provided for Python async, Node `lancedb`, and Node `vectordb` (and Rust of course). Python synchronous is omitted because it's not compatible with the PyArrow filesystems we use there currently. In the future, we will move the sync API to wrap the async one, and then it will get support for `storage_options`. 1. Fixes #1168 2. Closes #1165 3. Closes #1082 4. Closes #439 5. Closes #897 6. Closes #642 7. Closes #281 8. Closes #114 9. Closes #990 10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged to use `storageOptions` instead.	2024-04-10 10:12:04 -07:00
Raghav Dixit	1c41a00d87	Embeddings: HF model hub support added via transformers (#1154 )	2024-04-05 16:34:56 -07:00
Weston Pace	f97c7dad8c	docs: add the async python API to the docs (#1156 )	2024-04-05 16:34:37 -07:00
Lei Xu	473ef7e426	chore: validate table name (#1146 ) Closes #1129	2024-04-05 16:33:37 -07:00
Ishani Ghose	0838e12b30	feat: add to_batches API #805 (#1048 ) SDK Python Description Exposes pyarrow batch api during query execution - relevant when there is no vector search query, dataset is large and the filtered result is larger than memory. --------- Co-authored-by: Ishani Ghose <isghose@amazon.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:33:37 -07:00
natcharacter	f6e9f8e3f4	Order by field support FTS (#1132 ) This PR adds support for passing through a set of ordering fields at index time (unsigned ints that tantivity can use as fast_fields) that at query time you can sort your results on. This is useful for cases where you want to get related hits, i.e by keyword, but order those hits by some other score, such as popularity. I.e search for songs descriptions that match on "sad AND jazz AND 1920" and then order those by number of times played. Example usage can be seen in the fts tests. --------- Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:33:36 -07:00
Chang She	4466cfa958	feat(python): support writing huggingface dataset and dataset dict (#1110 ) HuggingFace Dataset is written as arrow batches. For DatasetDict, all splits are written with a "split" column appended. - [x] what if the dataset schema already has a `split` column - [x] add unit tests	2024-04-05 16:33:06 -07:00
Ayush Chaurasia	42fad84ec8	feat(python): Support reranking for vector and fts (#1103 ) solves https://github.com/lancedb/lancedb/issues/1086 Usage Reranking with FTS: ``` retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite") pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274."}, {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."}, {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."}, {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "}, {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."}, {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."}, ] retriever.add(pylist) retriever.create_fts_index("text", replace=True) query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query, query_type="fts").limit(10).to_pandas()) print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas()) ``` Result ``` text vector score 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 1 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 2 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 4 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 text vector score _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 0.041462 ``` ## Vector Search usage: ``` query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query).limit(10).to_pandas()) print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here ``` Results ``` text vector _distance 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 1 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 2 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 3 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 4 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 text vector _distance _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 0.041462 ```	2024-04-05 16:33:06 -07:00
Weston Pace	4180b44472	feat: refactor the query API and add query support to the python async API (#1113 ) In addition, there are also a number of changes in nodejs to the docstrings of existing methods because this PR adds a jsdoc linter.	2024-04-05 16:32:47 -07:00
Christian Di Lorenzo	8bb983bc3d	fix(python): Add python azure blob read support (#1102 ) I know there's a larger effort to have the python client based on the core rust implementation, but in the meantime there have been several issues (#1072 and #485) with some of the azure blob storage calls due to pyarrow not natively supporting an azure backend. To this end, I've added an optional import of the fsspec implementation of azure blob storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to `pyarrow.fs`. I've modified the existing test and manually verified it with some real credentials to make sure it behaves as expected. It should be now as simple as: ```python import lancedb db = lancedb.connect("az://blob_name/path") table = db.open_table("test") table.search(...) ``` Thank you for this cool project and we're excited to start using this for real shortly! 🎉 And thanks to @dwhitena for bringing it to my attention with his prediction guard posts. Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>	2024-04-05 16:32:31 -07:00
Chang She	377832e532	feat(python): support optional vector field in pydantic model (#1097 ) The LanceDB embeddings registry allows users to annotate the pydantic model used as table schema with the desired embedding function, e.g.: ```python class Schema(LanceModel): id: str vector: Vector(openai.ndims()) = openai.VectorField() text: str = openai.SourceField() ``` Tables created like this does not require embeddings to be calculated by the user explicitly, e.g. this works: ```python table.add([{"id": "foo", "text": "rust all the things"}]) ``` However, trying to construct pydantic model instances without vector doesn't because it's a required field. Instead, you need add a default value: ```python class Schema(LanceModel): id: str vector: Vector(openai.ndims()) = openai.VectorField(default=None) text: str = openai.SourceField() ``` then this completes without errors: ```python table.add([Schema(id="foo", text="rust all the things")]) ``` However, all of the vectors are filled with zeros. Instead in add_vector_col we have to add an additional check so that the embedding generation is called.	2024-04-05 16:32:15 -07:00
Weston Pace	b6a522d483	feat: add list_indices to the async api (#1074 )	2024-04-05 16:32:15 -07:00
Weston Pace	9031ec6878	feat: add update to the async API (#1093 )	2024-04-05 16:32:15 -07:00
Weston Pace	47daf9b7b0	feat: add time travel operations to the async API (#1070 )	2024-04-05 16:32:15 -07:00
Weston Pace	f822255683	feat: add create_index to the async python API (#1052 ) This also refactors the rust lancedb index builder API (and, correspondingly, the nodejs API)	2024-04-05 16:32:14 -07:00
Weston Pace	73c69a6b9a	feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059 ) The synchronous table_names function in python lancedb relies on arrow's filesystem which behaves slightly differently than object_store. As a result, the function would not work properly in GCS. However, the async table_names function uses object_store directly and thus is accurate. In most cases we can fallback to using the async table_names function and so this PR does so. The one case we cannot is if the user is already in an async context (we can't start a new async event loop). Soon, we can just redirect those users to use the async API instead of the sync API and so that case will eventually go away. For now, we fallback to the old behavior.	2024-04-05 16:31:45 -07:00
Chang She	10089481c0	doc(python): document the method in fts (#982 ) Co-authored-by: prrao87 <prrao87@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>	2024-04-05 16:31:45 -07:00
Ayush Chaurasia	b5326d31e9	fix(python): Few fts patches (#1039 ) 1. filtering with fts mutated the schema, which caused schema mistmatch problems with hybrid search as it combines fts and vector search tables. 2. fts with filter failed with `with_row_id`. This was because row_id was calculated before filtering which caused size mismatch on attaching it after. 3. The fix for 1 meant that now row_id is attached before filtering but passing a filter to `to_lance` on a dataset that already contains `_rowid` raises a panic from lance. So temporarily, in case where fts is used with a filter AND `with_row_id`, we just force user to using the duckdb pathway. --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:31:45 -07:00
Weston Pace	8033a44d68	feat: add support for add to async python API (#1037 ) In order to add support for `add` we needed to migrate the rust `Table` trait to a `Table` struct and `TableInternal` trait (similar to the way the connection is designed). While doing this we also cleaned up some inconsistencies between the SDKs: * Python and Node are garbage collected languages and it can be difficult to trigger something to be freed. The convention for these languages is to have some kind of close method. I added a close method to both the table and connection which will drop the underlying rust object. * We made significant improvements to table creation in `cc5f2136a6` for the `node` SDK. I copied these changes to the `nodejs` SDK. * The nodejs tables were using fs to create tmp directories and these were not getting cleaned up. This is mostly harmless but annoying and so I changed it up a bit to ensure we cleanup tmp directories. * ~~countRows in the node SDK was returning `bigint`. I changed it to return `number`~~ (this actually happened in a previous PR) * Tables and connections now implement `std::fmt::Display` which is hooked into python's `__repr__`. Node has no concept of a regular "to string" function and so I added a `display` method. * Python method signatures are changing so that optional parameters are always `Optional[foo] = None` instead of something like `foo = False`. This is because we want those defaults to be in rust whenever possible (though we still need to mention the default in documentation). * I changed the python `AsyncConnection/AsyncTable` classes from abstract classes with a single implementation to just classes because we no longer have the remote implementation in python. Note: this does NOT add the `add` function to the remote table. This PR was already large enough, and the remote implementation is unique enough, that I am going to do all the remote stuff at a later date (we should have the structure in place and correct so there shouldn't be any refactor concerns) --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-04-05 16:31:36 -07:00
Weston Pace	4299f719ec	feat: port create_table to the async python API and the remote rust API (#1031 ) I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking changes from sync to async python.	2024-04-05 16:31:36 -07:00
Rob Meng	b8eb5d4bfe	fix: fix columns type for pydantic 2.x (#1045 )	2024-04-05 16:31:36 -07:00

1 2

52 Commits