Commit Graph

265 Commits

Author SHA1 Message Date
Will Jones
73b2977bff chore: upgrade lance to 0.9.16 (#975) 2024-02-14 14:20:03 -08:00
Lance Release
5b60412d66 [python] Bump version: 0.5.4 → 0.5.5 2024-02-13 23:30:35 +00:00
Ayush Chaurasia
eb31d95fef feat(python): hybrid search updates, examples, & latency benchmarks (#964)
- Rename safe_import -> attempt_import_or_raise (closes
https://github.com/lancedb/lancedb/pull/923)
- Update docs
- Add Notebook example (@changhiskhan you can use it for the talk. Comes
with "open in colab" button)
- Latency benchmark & results comparison, sanity check on real-world
data
- Updates the default openai model to gpt-4
2024-02-13 17:58:39 +05:30
QianZhu
1b990983b3 Qian/make vector col optional (#950)
remote SDK tests were completed through lancedb_integtest
2024-02-12 16:35:44 -08:00
Lance Release
82936c77ef [python] Bump version: 0.5.3 → 0.5.4 2024-02-09 22:56:45 +00:00
Weston Pace
dddcddcaf9 chore: bump lance version to 0.9.15 (#949) 2024-02-09 14:55:44 -08:00
Weston Pace
a9727eb318 feat: add support for filter during merge insert when matched (#948)
Closes #940
2024-02-09 10:26:14 -08:00
QianZhu
48d55bf952 added error msg to SaaS APIs (#852)
1. improved error msg for SaaS create_table and create_index

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-02-09 10:07:47 -08:00
Weston Pace
d2e71c8b08 feat: add a filterable count_rows to all the lancedb APIs (#913)
A `count_rows` method that takes a filter was recently added to
`LanceTable`. This PR adds it everywhere else except `RemoteTable` (that
will come soon).
2024-02-08 09:40:29 -08:00
Ayush Chaurasia
d982ee934a feat(python): Reranker DX improvements (#904)
- Most users might not know how to use `QueryBuilder` object. Instead we
should just pass the string query.
- Add new rerankers: Colbert, openai
2024-02-06 13:59:31 +05:30
Will Jones
57605a2d86 feat(python): add read_consistency_interval argument (#828)
This PR refactors how we handle read consistency: does the `LanceTable`
class always pick up modifications to the table made by other instance
or processes. Users have three options they can set at the connection
level:

1. (Default) `read_consistency_interval=None` means it will not check at
all. Users can call `table.checkout_latest()` to manually check for
updates.
2. `read_consistency_interval=timedelta(0)` means **always** check for
updates, giving strong read consistency.
3. `read_consistency_interval=timedelta(seconds=20)` means check for
updates every 20 seconds. This is eventual consistency, a compromise
between the two options above.

## Table reference state

There is now an explicit difference between a `LanceTable` that tracks
the current version and one that is fixed at a historical version. We
now enforce that users cannot write if they have checked out an old
version. They are instructed to call `checkout_latest()` before calling
the write methods.

Since `conn.open_table()` doesn't have a parameter for version, users
will only get fixed references if they call `table.checkout()`.

The difference between these two can be seen in the repr: Table that are
fixed at a particular version will have a `version` displayed in the
repr. Otherwise, the version will not be shown.

```python
>>> table
LanceTable(connection=..., name="my_table")
>>> table.checkout(1)
>>> table
LanceTable(connection=..., name="my_table", version=1)
```

I decided to not create different classes for these states, because I
think we already have enough complexity with the Cloud vs OSS table
references.

Based on #812
2024-02-05 08:12:19 -08:00
Ayush Chaurasia
738511c5f2 feat(python): add support new openai embedding functions (#912)
@PrashantDixit0

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-02-04 18:19:42 -08:00
Lance Release
a9088224c5 [python] Bump version: 0.5.2 → 0.5.3 2024-02-03 03:04:04 +00:00
Ayush Chaurasia
688c57a0d8 fix: revert safe_import_pandas usage (#921) 2024-02-02 18:57:13 -08:00
Lance Release
ce2242e06d [python] Bump version: 0.5.1 → 0.5.2 2024-02-02 21:33:02 +00:00
Weston Pace
778339388a chore: bump pylance version to latest in pyproject.toml (#918) 2024-02-02 13:32:12 -08:00
Weston Pace
7f8637a0b4 feat: add merge_insert to the node and rust APIs (#915) 2024-02-02 13:16:51 -08:00
QianZhu
09cd08222d make it explicit about the vector column data type (#916)
<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964">
[
<img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07">
](url)

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-02-02 09:02:02 -08:00
Bert
a248d7feec fix: add request retry to python client (#917)
Adds capability to the remote python SDK to retry requests (fixes #911)

This can be configured through environment:
- `LANCE_CLIENT_MAX_RETRIES`= total number of retries. Set to 0 to
disable retries. default = 3
- `LANCE_CLIENT_CONNECT_RETRIES` = number of times to retry request in
case of TCP connect failure. default = 3
- `LANCE_CLIENT_READ_RETRIES` = number of times to retry request in case
of HTTP request failure. default = 3
- `LANCE_CLIENT_RETRY_STATUSES` = http statuses for which the request
will be retried. passed as comma separated list of ints. default `500,
502, 503`
- `LANCE_CLIENT_RETRY_BACKOFF_FACTOR` = controls time between retry
requests. see
[here](23f2287eb5/src/urllib3/util/retry.py (L141-L146)).
default = 0.25

Only read requests will be retried:
- list table names
- query
- describe table
- list table indices

This does not add retry capabilities for writes as it could possibly
cause issues in the case where the retried write isn't idempotent. For
example, in the case where the LB times-out the request but the server
completes the request anyway, we might not want to blindly retry an
insert request.
2024-02-02 11:27:29 -05:00
Weston Pace
cc9473a94a docs: add cleanup_old_versions and compact_files to Table for documentation purposes (#900)
Closes #819
2024-02-01 15:06:00 -08:00
Weston Pace
d77e95a4f4 feat: upgrade to lance 0.9.11 and expose merge_insert (#906)
This adds the python bindings requested in #870 The javascript/rust
bindings will be added in a future PR.
2024-02-01 11:36:29 -08:00
Raghav Dixit
9df6905d86 chore(python): GTE embedding function model name update (#902)
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-01-30 23:56:29 +05:30
Ayush Chaurasia
3ffed89793 feat(python): Hybrid search & Reranker API (#824)
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-01-30 19:10:33 +05:30
Raghav Dixit
d1a7257810 feat(python): Embedding fn support for gte-mlx/gte-large (#873)
have added testing and an example in the docstring, will be pushing a
separate PR in recipe repo for rag example

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-01-30 11:21:57 +05:30
Ayush Chaurasia
5c5e23bbb9 chore(python): Temporarily extend remote connection timeout (#888)
Context - https://etoai.slack.com/archives/C05NC5YSW5V/p1706371205883149
2024-01-29 17:34:33 +05:30
Ayush Chaurasia
d84e0d1db8 feat(python): Aws Bedrock embeddings integration (#822)
Supports amazon titan, cohere english & cohere multi-lingual base
models.
2024-01-28 02:04:15 +05:30
Lei Xu
ac94b2a420 chore: upgrade lance, pylance and datafusion (#879) 2024-01-27 12:31:38 -08:00
Bert
82cbcf6d07 Bump lance 0.9.9 (#851) 2024-01-24 08:41:28 -05:00
Lance Release
41f0e32a06 [python] Bump version: 0.5.0 → 0.5.1 2024-01-23 22:01:14 +00:00
QianZhu
b4d451ed21 extend timeout for requests.get and requests.post (#848) 2024-01-22 20:31:39 -08:00
Bert
66eaa2a00e allow passing api key as env var (#841)
Allow passing API key as env var:
```shell
export LANCEDB_API_KEY=sh_123...
```

with this set, apiKey argument can omitted from `connect`
```js
    const db = await vectordb.connect({
        uri: "db://test-proj-01-ae8343",
        region: "us-east-1",
  })
```
```py
    db = lancedb.connect(
        uri="db://test-proj-01-ae8343",
        region="us-east-1",
    )
```
2024-01-22 16:18:28 -05:00
Lei Xu
83ed8d1e49 bug: add a test for fp16 (#837)
Add test to ingest fp16 to a database
2024-01-20 16:23:28 -08:00
Bert
c89d5e6e6d fix: remote python client closes idle connections (#831) 2024-01-19 17:28:36 -05:00
Will Jones
d012db24c2 ci: lint and enforce linting (#829)
@eddyxu added instructions for linting here:


7af213801a/python/README.md (L45-L50)

However, we had a lot of failures and weren't checking this in CI. This
PR fixes all lints and adds a check to CI to keep us in compliance with
the lints.
2024-01-19 13:09:14 -08:00
Bert
7af213801a bump lance to 0.9.7 (#826) 2024-01-18 20:44:22 -08:00
Prashanth Rao
119b928a52 docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation.

* Closes #697 
* Closes #698

## Chores
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

## Restructure/new content 
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-01-19 00:18:37 +05:30
Lance Release
8bcdc81fd3 [python] Bump version: 0.4.4 → 0.5.0 2024-01-18 01:53:15 +00:00
Chang She
39e14c70c5 chore(python): turn off lazy frame ingestion (#821) 2024-01-16 19:11:16 -08:00
Chang She
af8263af94 feat(python): allow the entire table to be converted a polars dataframe (#814) 2024-01-15 15:49:16 -08:00
Chang She
be4ab9eef3 feat(python): add exist_ok option to create table (#813)
This mimics CREATE TABLE IF NOT EXISTS behavior.
We add `db.create_table(..., exist_ok=True)` parameter.
By default it is set to False, so trying to create
a table with the same name will raise an exception.
If set to True, then it only opens the table if it
already exists. If you pass in a schema, it will
be checked against the existing table to make sure
you get what you want. If you pass in data, it will
NOT be added to the existing table.
2024-01-15 11:09:18 -08:00
Ayush Chaurasia
184d2bc969 chore(python): get rid of Pydantic deprication warning in embedding fcn (#816)
```
UserWarning: Valid config keys have changed in V2:
* 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning)
```
2024-01-15 12:19:51 +05:30
Anton Shevtsov
ff6f005336 Add openai api key not found help (#815)
This pull request adds check for the presence of an environment variable
`OPENAI_API_KEY` and removes an unused parameter in
`retry_with_exponential_backoff` function.
2024-01-15 02:44:09 +05:30
Chang She
49333e522c feat(python): basic polars integration (#811)
We should now be able to directly ingest polars dataframes and return
results as polars dataframes


![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)
2024-01-13 16:38:16 -08:00
Ayush Chaurasia
4568df422d feat(python): Add gemini text embedding function (#806)
Named it Gemini-text for now. Not sure how complicated it will be to
support both text and multimodal embeddings under the same class
"gemini"..But its not something to worry about for now I guess.
2024-01-12 22:38:55 -08:00
Lance Release
0a16e29b93 [python] Bump version: 0.4.3 → 0.4.4 2024-01-11 21:29:00 +00:00
Will Jones
cf7d7a19f5 upgrade lance (#809) 2024-01-11 13:28:10 -08:00
Lei Xu
fe2fb91a8b chore: remove black as dependency (#808)
We use `ruff` in CI and dev workflow now.
2024-01-11 10:58:49 -08:00
Sebastian Law
99adfe065a use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-01-10 00:07:50 -05:00
Chang She
277406509e chore(python): add docstring for limit behavior (#800)
Closes #796
2024-01-09 20:20:13 -08:00
Chang She
63411b4d8b feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-01-09 19:41:31 -08:00