lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2025-12-25 14:29:56 +00:00

Author	SHA1	Message	Date
Lance Release	c8f92c2987	[python] Bump version: 0.5.1 → 0.5.2	2024-04-05 16:29:05 -07:00
Weston Pace	9d115bd507	chore: bump pylance version to latest in pyproject.toml (#918 )	2024-04-05 16:29:05 -07:00
Weston Pace	18f7bad3dd	feat: add merge_insert to the node and rust APIs (#915 )	2024-04-05 16:29:05 -07:00
QianZhu	2e75b16403	make it explicit about the vector column data type (#916 ) <img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM" src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964"> [ <img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM" src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07"> ](url) --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-04-05 16:29:05 -07:00
Bert	3c544582f6	fix: add request retry to python client (#917 ) Adds capability to the remote python SDK to retry requests (fixes #911) This can be configured through environment: - `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to disable retries. default = 3 - `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in case of TCP connect failure. default = 3 - `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case of HTTP request failure. default = 3 - `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request will be retried. passed as comma separated list of ints. default `500, 502, 503` - `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry requests. see [here](`23f2287eb5/src/urllib3/util/retry.py (L141-L146)`). default = 0.25 Only read requests will be retried: - list table names - query - describe table - list table indices This does not add retry capabilities for writes as it could possibly cause issues in the case where the retried write isn't idempotent. For example, in the case where the LB times-out the request but the server completes the request anyway, we might not want to blindly retry an insert request.	2024-04-05 16:29:05 -07:00
Weston Pace	f602e07f99	docs: add cleanup_old_versions and compact_files to `Table` for documentation purposes (#900 ) Closes #819	2024-04-05 16:29:05 -07:00
Weston Pace	4eb819072a	feat: upgrade to lance 0.9.11 and expose merge_insert (#906 ) This adds the python bindings requested in #870 The javascript/rust bindings will be added in a future PR.	2024-04-05 16:29:05 -07:00
Raghav Dixit	61bf688e5b	chore(python): GTE embedding function model name update (#902 ) Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>	2024-04-05 16:28:56 -07:00
Ayush Chaurasia	a41f7be88d	feat(python): Hybrid search & Reranker API (#824 ) based on https://github.com/lancedb/lancedb/pull/713 - The Reranker api can be plugged into vector only or fts only search but this PR doesn't do that (see example - https://txt.cohere.com/rerank/) ### Default reranker -- `LinearCombinationReranker(weight=0.7, fill=1.0)` ``` table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas() ``` ### Available rerankers LinearCombinationReranker ``` from lancedb.rerankers import LinearCombinationReranker # Same as default table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=LinearCombinationReranker() ).to_pandas() # with custom params reranker = LinearCombinationReranker(weight=0.3, fill=1.0) table.search("hello", query_type="hybrid").rerank( normalize="score", reranker=reranker ).to_pandas() ``` Cohere Reranker ``` from lancedb.rerankers import CohereReranker # default model.. English and multi-lingual supported. See docstring for available custom params table.search("hello", query_type="hybrid").rerank( normalize="rank", # score or rank reranker=CohereReranker() ).to_pandas() ``` CrossEncoderReranker ``` from lancedb.rerankers import CrossEncoderReranker table.search("hello", query_type="hybrid").rerank( normalize="rank", reranker=CrossEncoderReranker() ).to_pandas() ``` ## Using custom Reranker ``` from lancedb.reranker import Reranker class CustomReranker(Reranker): def rerank_hybrid(self, vector_result, fts_result): combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic # Custom rerank logic here return combined_res ``` - [x] Expand testing - [x] Make sure usage makes sense - [x] Run simple benchmarks for correctness (Seeing weird result from cohere reranker in the toy example) - Support diverse rerankers by default: - [x] Cross encoding - [x] Cohere - [x] Reciprocal Rank Fusion --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>	2024-04-05 16:28:56 -07:00
Raghav Dixit	472344fcb3	feat(python): Embedding fn support for gte-mlx/gte-large (#873 ) have added testing and an example in the docstring, will be pushing a separate PR in recipe repo for rag example --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>	2024-04-05 16:28:56 -07:00
Ayush Chaurasia	bca80939c2	chore(python): Temporarily extend remote connection timeout (#888 ) Context - https://etoai.slack.com/archives/C05NC5YSW5V/p1706371205883149	2024-04-05 16:28:56 -07:00
Ayush Chaurasia	545a03d7f9	feat(python): Aws Bedrock embeddings integration (#822 ) Supports amazon titan, cohere english & cohere multi-lingual base models.	2024-04-05 16:28:56 -07:00
Lei Xu	f2e29eb004	chore: upgrade lance, pylance and datafusion (#879 )	2024-04-05 16:28:56 -07:00
Bert	d1f9722bfb	Bump lance 0.9.9 (#851 )	2024-04-05 16:27:51 -07:00
Lance Release	7b8188bcd5	[python] Bump version: 0.5.0 → 0.5.1	2024-04-05 16:27:51 -07:00
QianZhu	5ecbf971e2	extend timeout for requests.get and requests.post (#848 )	2024-04-05 16:27:42 -07:00
Bert	a409000c6f	allow passing api key as env var (#841 ) Allow passing API key as env var: ```shell export LANCEDB_API_KEY=sh_123... ``` with this set, apiKey argument can omitted from `connect` ```js const db = await vectordb.connect({ uri: "db://test-proj-01-ae8343", region: "us-east-1", }) ``` ```py db = lancedb.connect( uri="db://test-proj-01-ae8343", region="us-east-1", ) ```	2024-04-05 16:27:42 -07:00
Lei Xu	97d033dfd6	bug: add a test for fp16 (#837 ) Add test to ingest fp16 to a database	2024-04-05 16:27:42 -07:00
Bert	7bdca7a092	fix: remote python client closes idle connections (#831 )	2024-04-05 16:27:32 -07:00
Will Jones	5f6d13e958	ci: lint and enforce linting (#829 ) @eddyxu added instructions for linting here: `7af213801a/python/README.md (L45-L50)` However, we had a lot of failures and weren't checking this in CI. This PR fixes all lints and adds a check to CI to keep us in compliance with the lints.	2024-04-05 16:27:31 -07:00
Bert	4243eaee93	bump lance to 0.9.7 (#826 )	2024-04-05 16:27:14 -07:00
Prashanth Rao	4d5d748acd	docs: Updates and refactor (#683 ) This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2024-04-05 16:27:12 -07:00
Lance Release	33ab68c790	[python] Bump version: 0.4.4 → 0.5.0	2024-04-05 16:26:36 -07:00
Chang She	dbc3515d96	chore(python): turn off lazy frame ingestion (#821 )	2024-04-05 16:26:36 -07:00
Chang She	ac3d95ec34	feat(python): allow the entire table to be converted a polars dataframe (#814 )	2024-04-05 16:26:36 -07:00
Chang She	72b39432e8	feat(python): add exist_ok option to create table (#813 ) This mimics CREATE TABLE IF NOT EXISTS behavior. We add `db.create_table(..., exist_ok=True)` parameter. By default it is set to False, so trying to create a table with the same name will raise an exception. If set to True, then it only opens the table if it already exists. If you pass in a schema, it will be checked against the existing table to make sure you get what you want. If you pass in data, it will NOT be added to the existing table.	2024-04-05 16:26:35 -07:00
Ayush Chaurasia	340fd98b42	chore(python): get rid of Pydantic deprication warning in embedding fcn (#816 ) ``` UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning) ```	2024-04-05 16:26:20 -07:00
Anton Shevtsov	dc0b11a86a	Add openai api key not found help (#815 ) This pull request adds check for the presence of an environment variable `OPENAI_API_KEY` and removes an unused parameter in `retry_with_exponential_backoff` function.	2024-04-05 16:26:20 -07:00
Chang She	17dcb70076	feat(python): basic polars integration (#811 ) We should now be able to directly ingest polars dataframes and return results as polars dataframes ![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)	2024-04-05 16:26:19 -07:00
Ayush Chaurasia	2f72d5138e	feat(python): Add gemini text embedding function (#806 ) Named it Gemini-text for now. Not sure how complicated it will be to support both text and multimodal embeddings under the same class "gemini"..But its not something to worry about for now I guess.	2024-04-05 16:25:52 -07:00
Lance Release	1387dc6e48	[python] Bump version: 0.4.3 → 0.4.4	2024-04-05 16:25:02 -07:00
Will Jones	63e273606e	upgrade lance (#809 )	2024-04-05 16:25:02 -07:00
Lei Xu	45b006d68c	chore: remove black as dependency (#808 ) We use `ruff` in CI and dev workflow now.	2024-04-05 16:25:02 -07:00
Sebastian Law	4aa7f58a07	use requests instead of aiohttp for underlying http client (#803 ) instead of starting and stopping the current thread's event loop on every http call, just make an http call.	2024-04-05 16:25:01 -07:00
Chang She	7581cbb38f	chore(python): add docstring for limit behavior (#800 ) Closes #796	2024-04-05 16:25:01 -07:00
Chang She	881dfa022b	feat(python): add phrase query option for fts (#798 ) addresses #797 Problem: tantivy does not expose option to explicitly Proposed solution here: 1. Add a `.phrase_query()` option 2. Under the hood, LanceDB takes care of wrapping the input in quotes and replace nested double quotes with single quotes I've also filed an upstream issue, if they support phrase queries natively then we can get rid of our manual custom processing here.	2024-04-05 16:25:01 -07:00
Chang She	f17d16f935	feat(python): add count_rows with filter option (#801 ) Closes #795	2024-04-05 16:25:01 -07:00
Chang She	a07c6c465a	feat(python): support new style optional syntax (#793 )	2024-04-05 16:25:01 -07:00
Chang She	1dd663fc8a	chore(python): document phrase queries in fts (#788 ) closes #769 Add unit test and documentation on using quotes to perform a phrase query	2024-04-05 16:25:01 -07:00
Lei Xu	4c8690549a	chore: bump lance to 0.9.5 (#790 )	2024-04-05 16:25:01 -07:00
Chang She	3100f0d861	feat(python): Set heap size to get faster fts indexing performance (#762 ) By default tantivy-py uses 128MB heapsize. We change the default to 1GB and we allow the user to customize this locally this makes `test_fts.py` run 10x faster	2024-04-05 16:25:00 -07:00
lucasiscovici	328aa2247b	raise exception if fts index does not exist (#776 ) raise exception if fts index does not exist --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:24:47 -07:00
Chang She	e929491187	chore(python): handle NaN input in fts ingestion (#763 ) If the input text is None, Tantivy raises an error complaining it cannot add a NoneType. We handle this upstream so None's are not added to the document. If all of the indexed fields are None then we skip this document.	2024-04-05 16:24:47 -07:00
Lance Release	918a2a4405	[python] Bump version: 0.4.2 → 0.4.3	2024-04-05 16:24:47 -07:00
Lei Xu	56db257ea9	chore: bump pylance to 0.9.2 (#754 )	2024-04-05 16:24:47 -07:00
Chang She	98af0ceec6	feat(python): first cut batch queries for remote api (#753 ) issue separate requests under the hood and concatenate results	2024-04-05 16:24:47 -07:00
Lance Release	7778031b26	[python] Bump version: 0.4.1 → 0.4.2	2024-04-05 16:24:47 -07:00
Chang She	c97ae6b787	chore(python): update embedding API to use openai 1.6.1 (#751 ) API has changed significantly, namely `openai.Embedding.create` no longer exists. https://github.com/openai/openai-python/discussions/742 Update the OpenAI embedding function and put a minimum on the openai sdk version.	2024-04-05 16:24:47 -07:00
Chang She	7bac1131fb	feat: add timezone handling for datetime in pydantic (#578 ) If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.	2024-04-05 16:24:47 -07:00
Chang She	a0afa84786	feat(python): add post filtering for full text search (#739 ) Closes #721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once #676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-04-05 16:24:47 -07:00

... 2 3 4 5 6 ...

399 Commits