lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-03 19:10:41 +00:00

Author	SHA1	Message	Date
Prashanth Rao	119b928a52	docs: Updates and refactor (#683 ) This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 ## Chores - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. ## Restructure/new content - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2024-01-19 00:18:37 +05:30
Lance Release	8bcdc81fd3	[python] Bump version: 0.4.4 → 0.5.0	2024-01-18 01:53:15 +00:00
Chang She	39e14c70c5	chore(python): turn off lazy frame ingestion (#821 )	2024-01-16 19:11:16 -08:00
Chang She	af8263af94	feat(python): allow the entire table to be converted a polars dataframe (#814 )	2024-01-15 15:49:16 -08:00
Chang She	be4ab9eef3	feat(python): add exist_ok option to create table (#813 ) This mimics CREATE TABLE IF NOT EXISTS behavior. We add `db.create_table(..., exist_ok=True)` parameter. By default it is set to False, so trying to create a table with the same name will raise an exception. If set to True, then it only opens the table if it already exists. If you pass in a schema, it will be checked against the existing table to make sure you get what you want. If you pass in data, it will NOT be added to the existing table.	2024-01-15 11:09:18 -08:00
Ayush Chaurasia	184d2bc969	chore(python): get rid of Pydantic deprication warning in embedding fcn (#816 ) ``` UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning) ```	2024-01-15 12:19:51 +05:30
Anton Shevtsov	ff6f005336	Add openai api key not found help (#815 ) This pull request adds check for the presence of an environment variable `OPENAI_API_KEY` and removes an unused parameter in `retry_with_exponential_backoff` function.	2024-01-15 02:44:09 +05:30
Chang She	49333e522c	feat(python): basic polars integration (#811 ) We should now be able to directly ingest polars dataframes and return results as polars dataframes ![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)	2024-01-13 16:38:16 -08:00
Ayush Chaurasia	4568df422d	feat(python): Add gemini text embedding function (#806 ) Named it Gemini-text for now. Not sure how complicated it will be to support both text and multimodal embeddings under the same class "gemini"..But its not something to worry about for now I guess.	2024-01-12 22:38:55 -08:00
Lance Release	0a16e29b93	[python] Bump version: 0.4.3 → 0.4.4	2024-01-11 21:29:00 +00:00
Will Jones	cf7d7a19f5	upgrade lance (#809 )	2024-01-11 13:28:10 -08:00
Lei Xu	fe2fb91a8b	chore: remove black as dependency (#808 ) We use `ruff` in CI and dev workflow now.	2024-01-11 10:58:49 -08:00
Sebastian Law	99adfe065a	use requests instead of aiohttp for underlying http client (#803 ) instead of starting and stopping the current thread's event loop on every http call, just make an http call.	2024-01-10 00:07:50 -05:00
Chang She	277406509e	chore(python): add docstring for limit behavior (#800 ) Closes #796	2024-01-09 20:20:13 -08:00
Chang She	63411b4d8b	feat(python): add phrase query option for fts (#798 ) addresses #797 Problem: tantivy does not expose option to explicitly Proposed solution here: 1. Add a `.phrase_query()` option 2. Under the hood, LanceDB takes care of wrapping the input in quotes and replace nested double quotes with single quotes I've also filed an upstream issue, if they support phrase queries natively then we can get rid of our manual custom processing here.	2024-01-09 19:41:31 -08:00
Chang She	d998f80b04	feat(python): add count_rows with filter option (#801 ) Closes #795	2024-01-09 19:33:03 -08:00
Chang She	99ba5331f0	feat(python): support new style optional syntax (#793 )	2024-01-09 07:03:29 -08:00
Chang She	121687231c	chore(python): document phrase queries in fts (#788 ) closes #769 Add unit test and documentation on using quotes to perform a phrase query	2024-01-08 21:49:31 -08:00
Lei Xu	c5a52565ac	chore: bump lance to 0.9.5 (#790 )	2024-01-07 19:27:47 -08:00
Chang She	b0a88a7286	feat(python): Set heap size to get faster fts indexing performance (#762 ) By default tantivy-py uses 128MB heapsize. We change the default to 1GB and we allow the user to customize this locally this makes `test_fts.py` run 10x faster	2024-01-07 15:15:13 -08:00
lucasiscovici	d41d849e0e	raise exception if fts index does not exist (#776 ) raise exception if fts index does not exist --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-01-07 14:34:04 -08:00
Chang She	60b22d84bf	chore(python): handle NaN input in fts ingestion (#763 ) If the input text is None, Tantivy raises an error complaining it cannot add a NoneType. We handle this upstream so None's are not added to the document. If all of the indexed fields are None then we skip this document.	2024-01-04 11:45:12 -08:00
Lance Release	c3059dc689	[python] Bump version: 0.4.2 → 0.4.3	2023-12-30 00:52:54 +00:00
Lei Xu	a9caa5f2d4	chore: bump pylance to 0.9.2 (#754 )	2023-12-29 16:39:45 -08:00
Chang She	7773bda7ee	feat(python): first cut batch queries for remote api (#753 ) issue separate requests under the hood and concatenate results	2023-12-29 15:33:03 -08:00
Lance Release	392777952f	[python] Bump version: 0.4.1 → 0.4.2	2023-12-29 00:19:21 +00:00
Chang She	7e75e50d3a	chore(python): update embedding API to use openai 1.6.1 (#751 ) API has changed significantly, namely `openai.Embedding.create` no longer exists. https://github.com/openai/openai-python/discussions/742 Update the OpenAI embedding function and put a minimum on the openai sdk version.	2023-12-28 15:05:57 -08:00
Chang She	4b8af261a3	feat: add timezone handling for datetime in pydantic (#578 ) If you add timezone information in the Field annotation for a datetime then that will now be passed to the pyarrow data type. I'm not sure how pyarrow enforces timezones, right now, it silently coerces to the timezone given in the column regardless of whether the input had the matching timezone or not. This is probably not the right behavior. Though we could just make it so the user has to make the pydantic model do the validation instead of doing that at the pyarrow conversion layer.	2023-12-28 11:02:56 -08:00
Chang She	c8728d4ca1	feat(python): add post filtering for full text search (#739 ) Closes #721 fts will return results as a pyarrow table. Pyarrow tables has a `filter` method but it does not take sql filter strings (only pyarrow compute expressions). Instead, we do one of two things to support `tbl.search("keywords").where("foo=5").limit(10).to_arrow()`: Default path: If duckdb is available then use duckdb to execute the sql filter string on the pyarrow table. Backup path: Otherwise, write the pyarrow table to a lance dataset and then do `to_table(filter=<filter>)` Neither is ideal. Default path has two issues: 1. requires installing an extra library (duckdb) 2. duckdb mangles some fields (like fixed size list => list) Backup path incurs a latency penalty (~20ms on ssd) to write the resultset to disk. In the short term, once #676 is addressed, we can write the dataset to "memory://" instead of disk, this makes the post filter evaluate much quicker (ETA next week). In the longer term, we'd like to be able to evaluate the filter string on the pyarrow Table directly, one possibility being that we use Substrait to generate pyarrow compute expressions from sql string. Or if there's enough progress on pyarrow, it could support Substrait expressions directly (no ETA) --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-27 09:31:04 -08:00
Chang She	8f9ad978f5	feat(python): support list of list fields from pydantic schema (#747 ) For object detection, each row may correspond to an image and each image can have multiple bounding boxes of x-y coordinates. This means that a `bbox` field is potentially "list of list of float". This adds support in our pydantic-pyarrow conversion for nested lists.	2023-12-27 09:10:09 -08:00
Lance Release	60260018cf	[python] Bump version: 0.4.0 → 0.4.1	2023-12-26 16:51:16 +00:00
Will Jones	34966312cb	docs: enhance Update user guide (#735 ) Closes #705	2023-12-22 10:14:21 -08:00
Weston Pace	dc5126d8d1	feat: add the ability to create scalar indices (#679 ) This is a pretty direct binding to the underlying lance capability	2023-12-21 09:50:10 -08:00
Chang She	7bbb2872de	bug(python): fix path handling in windows (#724 ) Use pathlib for local paths so that pathlib can handle the correct separator on windows. Closes #703 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-20 15:41:36 -08:00
Will Jones	f9dd7a5d8a	fix: prevent duplicate data in FTS index (#728 ) This forces the user to replace the whole FTS directory when re-creating the index, prevent duplicate data from being created. Previously, the whole dataset was re-added to the existing index, duplicating existing rows in the index. This (in combination with lancedb/lance#1707) caused #726, since the duplicate data emitted duplicate indices for `take()` and an upstream issue caused those queries to fail. This solution isn't ideal, since it makes the FTS index temporarily unavailable while the index is built. In the future, we should have multiple FTS index directories, which would allow atomic commits of new indexes (as well as multiple indexes for different columns). Fixes #498. Fixes #726. --------- Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2023-12-20 13:07:07 -08:00
Will Jones	1d4943688d	upgrade lance to v0.9.1 (#727 ) This brings in some important bugfixes related to take and aarch64 Linux. See changes at: https://github.com/lancedb/lance/releases/tag/v0.9.1	2023-12-20 13:06:54 -08:00
Chang She	7856a94d2c	feat(python): support nested reference for fts (#723 ) https://github.com/lancedb/lance/issues/1739 Support nested field reference in full text search --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2023-12-20 12:28:53 -08:00
Chang She	371d2f979e	feat(python): add option to flatten output in to_pandas (#722 ) Closes https://github.com/lancedb/lance/issues/1738 We add a `flatten` parameter to the signature of `to_pandas`. By default this is None and does nothing. If set to True or -1, then LanceDB will flatten structs before converting to a pandas dataframe. All nested structs are also flattened. If set to any positive integer, then LanceDB will flatten structs up to the specified level of nesting. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2023-12-20 12:23:07 -08:00
Lance Release	018314a5c1	[python] Bump version: 0.3.6 → 0.4.0	2023-12-18 17:27:26 +00:00
Lei Xu	409eb30ea5	chore: bump lance version to 0.9 (#715 )	2023-12-17 22:11:42 -05:00
Lance Release	a0608044a1	[python] Bump version: 0.3.5 → 0.3.6	2023-12-15 18:20:55 +00:00
Bert	57207eff4a	implement update for remote clients (#706 )	2023-12-15 09:06:40 -05:00
Rob Meng	2d78bff120	feat: pass vector column name to remote backend (#710 ) pass vector column name to remote as well. `vector_column` is already part of `Query` just declearing it as part to `remote.VectorQuery` as well	2023-12-15 00:19:08 -05:00
Chang She	bd0034a157	feat: support nested pydantic schema (#707 )	2023-12-14 18:20:45 -08:00
Lance Release	600bfd7237	[python] Bump version: 0.3.4 → 0.3.5	2023-12-14 19:31:22 +00:00
Will Jones	d087e7891d	feat(python): add update query support for Python (#654 ) Closes #69 Will not pass until https://github.com/lancedb/lance/pull/1585 is released	2023-12-14 11:28:32 -08:00
Ayush Chaurasia	693091db29	chore(python): Reduce posthog event count (#661 ) - Register open_table as event - Because we're dropping 'seach' event currently, changed the name to 'search_table' and introduced throttling - Throttled events will be counted once per time batch so that the user is registered but event count doesn't go up by a lot	2023-12-08 11:00:51 -08:00
QianZhu	f6bbe199dc	Qian/minor fix doc (#695 )	2023-12-08 09:58:53 -08:00
QianZhu	aca785ff98	saas python sdk doc (#692 ) <img width="256" alt="Screenshot 2023-12-07 at 11 55 41 AM" src="https://github.com/lancedb/lancedb/assets/1305083/259bf234-9b3b-4c5d-af45-c7f3fada2cc7">	2023-12-07 14:47:56 -08:00
Bert	6eb662de9b	fix: python remote correct open_table error message (#659 )	2023-11-24 19:28:33 -05:00

1 2 3 4 5

230 Commits