lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-03 11:00:40 +00:00

Author	SHA1	Message	Date
BubbleCal	4b79db72bf	docs: improve the docs and API param name (#1629 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-09-11 10:18:29 +08:00
Gagan Bhullar	205fc530cf	feat: expose hnsw indices (#1595 ) PR closes #1522 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-09-10 11:08:13 -07:00
BubbleCal	2bde5401eb	feat: support to build FTS without positions (#1621 )	2024-09-10 22:51:32 +08:00
Antonio Molner Domenech	a405847f9b	fix(python): remove unmaintained ratelimiter dependency (#1603 ) The `ratelimiter` package hasn't been updated in ages and is no longer maintained. This PR removes the dependency on `ratelimiter` and replaces it with a custom rate limiter implementation. --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-09-09 12:35:53 -07:00
Will Jones	2a6586d6fb	feat: add flag to enable faster manifest paths (#1612 ) The new V2 manifest path scheme makes discovering the latest version of a table constant time on object stores, regardless of the number of versions in the table. See benchmarks in the PR here: https://github.com/lancedb/lance/pull/2798 Closes #1583	2024-09-09 11:34:36 -07:00
James Wu	029b01bbbf	feat: enable phrase_query(bool) for hybrid search queries (#1578 ) first off, apologies for any folly since i'm new to contributing to lancedb. this PR is the continuation of [a discord thread](https://discord.com/channels/1030247538198061086/1030247538667827251/1278844345713299599): ## user story here's the lance db search query i'd like to run: ``` def search(phrase): logger.info(f'Searching for phrase: {phrase}') phrase_embedding = get_embedding(phrase) df = (table.search((phrase_embedding, phrase), query_type='hybrid') .limit(10).to_list()) logger.info(f'Success search with row count: {len(df)}') search('howdy (howdy)') search('howdy(howdy)') ``` the second search fails due to `ValueError: Syntax Error: howdy(howdy)` i saw on the [docs](https://lancedb.github.io/lancedb/fts/#phrase-queries-vs-terms-queries) that i can use `phrase_query()` to [enable a flag](https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/query.py#L790-L792) to wrap the query in double quotes (as well as sanitize single quotes) prior to sending the query to search. this works for [normal FTS](https://lancedb.github.io/lancedb/fts/), but the command is unavailable on [hybrid search](https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/). ## changes i added `phrase_query()` function to `LanceHybridQueryBuilder` by propagating the call down to its `self. _fts_query` object. i'm not too familiar with the codebase and am not sure if this is the best way to implement the functionality. feel free to riff on this PR or discard ## tests ``` (lancedb) JamesMPB:python james$ pwd /Users/james/src/lancedb/python (lancedb) JamesMPB:python james$ pytest python/tests/test_table.py python/tests/test_table.py ....................................... [100%] ====================================================== 39 passed, 1 warning in 2.23s ======================================================= ```	2024-09-07 08:58:05 +05:30
BubbleCal	8dcd328dce	feat: support to create table from record batch iterator (#1593 )	2024-09-06 10:41:38 +08:00
Gagan Bhullar	b24810a011	feat(python, rust): expose offset in query (#1556 ) PR is part of #1555	2024-09-05 08:33:07 -07:00
Ayush Chaurasia	03ef1dc081	feat: update default reranker to RRF (#1580 ) - Both LinearCombination (the current default) and RRF are pretty fast compared to model based rerankers. RRF is slightly faster. - In our tests RRF has also been slightly more accurate. This PR: - Makes RRF the default reranker - Removed duplicate docs for rerankers	2024-09-03 14:00:13 +05:30
Ayush Chaurasia	dc72ece847	feat!: better api for manual hybrid queries (#1575 ) Currently, the only documented way of performing hybrid search is by using embedding API and passing string queries that get automatically embedded. There are use cases where users might like to pass vectors and text manually instead. This ticket contains more information and historical context - https://github.com/lancedb/lancedb/issues/937 This breaks a undocumented pathway that allowed passing (vector, text) tuple queries which was intended to be temporary, so this is marked as a breaking change. For all practical purposes, this should not really impact most users ### usage ``` results = table.search(query_type="hybrid") .vector(vector_query) .text(text_query) .limit(5) .to_pandas() ```	2024-08-30 17:37:58 +05:30
BubbleCal	1521435193	fix: specify column to search for FTS (#1572 ) Before this we ignored the `fts_columns` parameter, and for now we support to search on only one column, it could lead to an error if we have multiple indexed columns for FTS --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-29 23:43:46 +08:00
Gagan Bhullar	a85f039352	fix(bug): limit fix (#1548 ) PR fixes #1151	2024-08-26 14:25:14 -07:00
Ayush Chaurasia	549ca51a8a	feat: add answerdotai rerankers support and minor improvements (#1560 ) This PR: - Adds missing license headers - Integrates with answerdotai Rerankers package - Updates ColbertReranker to subclass answerdotai package. This is done to keep backwards compatibility as some users might be used to importing ColbertReranker directly - Set `trust_remote_code` to ` True` by default in CrossEncoder and sentence-transformer based rerankers	2024-08-26 13:25:10 +05:30
Gagan Bhullar	6eb7ccfdee	fix: rerank attribute unknown (#1554 ) PR fixes #1550	2024-08-22 11:46:36 +05:30
Ayush Chaurasia	7d65dd97cf	chore(python): update Colbert architecture and minor improvements (#1547 ) - Update ColBertReranker architecture: The current implementation doesn't use the right arch. This PR uses the implementation in Rerankers library. Fixes https://github.com/lancedb/lancedb/issues/1546 Benchmark diff (hit rate): Hybrid - 91 vs 87 reranked vector - 85 vs 80 - Reranking in FTS is basically disabled in main after last week's FTS updates. I think there's no blocker in supporting that? - Allow overriding accelerators: Most transformer based Rerankers and Embedding automatically select device. This PR allows overriding those settings by passing `device`. Fixes: https://github.com/lancedb/lancedb/issues/1487 --------- Co-authored-by: BubbleCal <bubble-cal@outlook.com>	2024-08-21 12:26:52 +05:30
Lei Xu	5857cb4c6e	docs: add a section to describe scalar index (#1495 )	2024-08-16 18:48:29 -07:00
BubbleCal	0fa50775d6	feat: support to query/index FTS on RemoteTable/AsyncTable (#1537 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-16 12:01:05 +08:00
Gagan Bhullar	20faa4424b	feat(python): add delete unverified parameter (#1542 ) PR fixes #1527	2024-08-15 09:01:32 -07:00
BubbleCal	b624fc59eb	docs: add `create_fts_index` doc in Python API Reference (#1533 ) resolve #1313 --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-15 11:35:16 +08:00
Ryan Green	b3daa25f46	feat: allow new scalar index types to be created in remote table (#1538 )	2024-08-13 16:05:42 -02:30
Lei Xu	b2317c904d	feat: create bitmap and label list scalar index using python async api (#1529 ) * Expose `bitmap` and `LabelList` scalar index type via Rust and Async Python API * Add documents	2024-08-11 09:16:11 -07:00
Gagan Bhullar	9c1adff426	feat(python): add to_list to async api (#1520 ) PR fixes #1517	2024-08-08 11:45:20 -07:00
BubbleCal	f9d5fa88a1	feat!: migrate FTS from tantivy to lance-index (#1483 ) Lance now supports FTS, so add it into lancedb Python, TypeScript and Rust SDKs. For Python, we still use tantivy based FTS by default because the lance FTS index now misses some features of tantivy. For Python: - Support to create lance based FTS index - Support to specify columns for full text search (only available for lance based FTS index) For TypeScript: - Change the search method so that it can accept both string and vector - Support full text search For Rust - Support full text search The others: - Update the FTS doc BREAKING CHANGE: - for Python, this renames the attached score column of FTS from "score" to "_score", this could be a breaking change for users that rely the scores --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-08 15:33:15 +08:00
Lei Xu	2bdf0a02f9	feat!: upgrade lance to 0.16 (#1519 )	2024-08-07 13:15:22 -07:00
Gagan Bhullar	32123713fd	feat(python): optimize stats repr method (#1510 ) PR fixes #1507	2024-08-07 08:47:52 -07:00
Gagan Bhullar	d5a01ffe7b	feat(python): index config repr method (#1509 ) PR fixes #1506	2024-08-07 08:46:46 -07:00
Ayush Chaurasia	e01045692c	feat(python): support embedding functions in remote table (#1405 )	2024-08-07 20:22:43 +05:30
Ayush Chaurasia	4769d8eb76	feat(python): multi-vector reranking support (#1481 ) Currently targeting the following usage: ``` from lancedb.rerankers import CrossEncoderReranker reranker = CrossEncoderReranker() query = "hello" res1 = table.search(query, vector_column_name="vector").limit(3) res2 = table.search(query, vector_column_name="text_vector").limit(3) res3 = table.search(query, vector_column_name="meta_vector").limit(3) reranked = reranker.rerank_multivector( [res1, res2, res3], deduplicate=True, query=query # some reranker models need query ) ``` - This implements rerank_multivector function in the base reranker so that all rerankers that implement rerank_vector will automatically have multivector reranking support - Special case for RRF reranker that just uses its existing rerank_hybrid fcn to multi-vector reranking. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-08-07 01:45:46 +05:30
Robby	8d2ff7b210	feat(python): add watsonx embeddings to registry (#1486 ) Related issue: https://github.com/lancedb/lancedb/issues/1412 --------- Co-authored-by: Robby <h0rv@users.noreply.github.com>	2024-08-06 10:58:33 +05:30
Ryan Green	6af69b57ad	fix: return LanceMergeInsertBuilder in overridden merge_insert method on remote table (#1484 )	2024-07-31 12:25:16 -02:30
Will Jones	9555efacf9	feat: upgrade lance to 0.15.0 (#1477 ) Changelog: https://github.com/lancedb/lance/releases/tag/v0.15.0 * Fixes #1466 * Closes #1475 * Fixes #1446	2024-07-26 09:13:49 -07:00
Chang She	374c1e7aba	fix: infer schema from huggingface dataset (#1444 ) Closes #1383 When creating a table from a HuggingFace dataset, infer the arrow schema directly	2024-07-23 13:12:34 -07:00
Ayush Chaurasia	0255221086	feat: add reciprocal rank fusion reranker (#1456 ) Implements https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf Refactors the hybrid search only rerrankers test to avoid repetition.	2024-07-23 21:37:17 +05:30
Weston Pace	d4aad82aec	fix: don't use v2 by default on empty table (#1469 )	2024-07-23 06:47:49 -07:00
Will Jones	4f601a2d4c	fix: handle camelCase column names in select (#1460 ) Fixes #1385	2024-07-22 12:53:17 -07:00
Lei Xu	c9c61eb060	docs: expose merge_insert doc for remote python SDK (#1464 ) `merge_insert` API is not shown up on [`RemoteTable`](https://lancedb.github.io/lancedb/python/saas-python/#lancedb.remote.table.RemoteTable) today * Also bump `ruff` version as well	2024-07-22 10:48:16 -07:00
Ayush Chaurasia	ed7bd45c17	chore: choose appropriate args for concat_table based on pyarrow version & refactor reranker tests (#1455 )	2024-07-18 21:04:59 +05:30
Magnus	dc609a337d	fix: added support for trust_remote_code (#1454 ) Closes #1285 Added trust_remote_code to the SentenceTransformerEmbeddings class. Defaults to `False`	2024-07-18 19:37:52 +05:30
Adam Azzam	82621d5b13	chore: typing for lance.connect (#1441 ) Feel free to close if this is a distraction, but untyped keywords in lance.connect is throwing pylance errors in strict mode. <img width="683" alt="Screenshot 2024-07-11 at 1 21 04 PM" src="https://github.com/lancedb/lancedb/assets/33043305/fe6cd4d9-4e59-413d-87f2-aabb9ff84cc4">	2024-07-12 10:39:28 -07:00
Lei Xu	0708428357	feat: support update over binary field (#1440 )	2024-07-12 09:22:00 -07:00
Joan Fontanals	cef24801f4	docs: add jina reranker to index (#1427 ) PR to add JinaReranker documentation page to the rerankers index	2024-07-09 14:39:35 +05:30
forrestmckee	b4436e0804	refactor: update type hint and remove unused import (#1436 ) change typehint on `_invert_score` from `List[float]` to `float`. remove unnecessary typing import	2024-07-09 13:56:45 +05:30
Lei Xu	fd5ca20f34	chore: bump lance to 0.14 (#1430 )	2024-07-06 14:10:42 -07:00
Joan Fontanals	08d25c5a80	feat: add Jina integration in Python for Embedding and Reranker (#1424 ) Integration of Jina Embeddings and Rerankers through its API	2024-07-05 01:34:43 +05:30
Nuvic	46c6ff889d	feat: add the explain_plan function (#1328 ) It's useful to see the underlying query plan for debugging purposes. This exposes LanceScanner's `explain_plan` function. Addresses #1288 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-07-02 11:10:01 -07:00
BubbleCal	12b3c87964	feat: support to create more vector index types (#1407 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-07-02 10:53:03 -02:30
Will Jones	865ed99881	feat: dynamodb commit store support (#1410 ) This allows users to specify URIs like: ``` s3+ddb://my_bucket/path?ddbTableName=myCommitTable ``` and it will support concurrent writes in S3. * [x] Add dynamodb integration tests * [x] Add modifications to get it working in Python sync API * [x] Added section in documentation describing how to configure. Closes #534 --------- Co-authored-by: universalmind303 <cory.grinstead@gmail.com>	2024-06-28 09:30:36 -07:00
josca42	0fe844034d	feat: enable stemming (#1356 ) Added the ability to specify tokenizer_name, when creating a full text search index using tantivy. This enables the use of language specific stemming. Also updated the [guide on full text search](https://lancedb.github.io/lancedb/fts/) with a short section on choosing tokenizer. Fixes #1315	2024-06-20 14:23:55 -07:00
harsha-mangena	a45656b8b6	docs: remove code-block:: python from docs (#1366 ) - refer #1264 - fixed minor documentation issue	2024-06-11 13:13:02 -07:00
Ayush Chaurasia	76fc16c7a1	docs: add retriever guide, address minor onboarding feedbacks & enhancement (#1326 ) - Tried to address some onboarding feedbacks listed in https://github.com/lancedb/lancedb/issues/1224 - Improve visibility of pydantic integration and embedding API. (Based on onboarding feedback - Many ways of ingesting data, defining schema but not sure what to use in a specific use-case) - Add a guide that takes users through testing and improving retriever performance using built-in utilities like hybrid-search and reranking - Add some benchmarks for the above - Add missing cohere docs --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-06-08 06:25:31 +05:30

... 2 3 4 5 6

255 Commits