This PR refactors how we handle read consistency: does the `LanceTable`
class always pick up modifications to the table made by other instance
or processes. Users have three options they can set at the connection
level:
1. (Default) `read_consistency_interval=None` means it will not check at
all. Users can call `table.checkout_latest()` to manually check for
updates.
2. `read_consistency_interval=timedelta(0)` means **always** check for
updates, giving strong read consistency.
3. `read_consistency_interval=timedelta(seconds=20)` means check for
updates every 20 seconds. This is eventual consistency, a compromise
between the two options above.
## Table reference state
There is now an explicit difference between a `LanceTable` that tracks
the current version and one that is fixed at a historical version. We
now enforce that users cannot write if they have checked out an old
version. They are instructed to call `checkout_latest()` before calling
the write methods.
Since `conn.open_table()` doesn't have a parameter for version, users
will only get fixed references if they call `table.checkout()`.
The difference between these two can be seen in the repr: Table that are
fixed at a particular version will have a `version` displayed in the
repr. Otherwise, the version will not be shown.
```python
>>> table
LanceTable(connection=..., name="my_table")
>>> table.checkout(1)
>>> table
LanceTable(connection=..., name="my_table", version=1)
```
I decided to not create different classes for these states, because I
think we already have enough complexity with the Cloud vs OSS table
references.
Based on #812
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)
### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`
```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker
# Same as default
table.search("hello", query_type="hybrid").rerank(
normalize="score",
reranker=LinearCombinationReranker()
).to_pandas()
# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
normalize="score",
reranker=reranker
).to_pandas()
```
Cohere Reranker
```
from lancedb.rerankers import CohereReranker
# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
normalize="rank", # score or rank
reranker=CohereReranker()
).to_pandas()
```
CrossEncoderReranker
```
from lancedb.rerankers import CrossEncoderReranker
table.search("hello", query_type="hybrid").rerank(
normalize="rank",
reranker=CrossEncoderReranker()
).to_pandas()
```
## Using custom Reranker
```
from lancedb.reranker import Reranker
class CustomReranker(Reranker):
def rerank_hybrid(self, vector_result, fts_result):
combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
# Custom rerank logic here
return combined_res
```
- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion
---------
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Closes https://github.com/lancedb/lance/issues/1738
We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes#555
Add `to_pandas` API and add deprecation warning on `to_df`. Closes#545
Co-authored-by: Chang She <chang@lancedb.com>
1. Support persistent embedding function so users can just search using
query string
2. Add fixed size list conversion for multiple vector columns
3. Add support for empty query (just apply select/where/limit).
4. Refactor and simplify some of the data prep code
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Combine delete and append to make a temporary update feature that is
only enabled for the local python lancedb.
The reason why this is temporary is because it first has to load the
data that matches the where clause into memory, which is technical
unbounded.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Previously if you needed to add a column to a table you'd have to
rewrite the whole table. Instead,
we use the merge functionality from Lance format
to incrementally add columns from another table
or dataframe.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Previously the temporary restore feature required copying data. The new
feature in pylance does not.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
This adds LanceTable.restore as a temporary feature. It reads data from
a previous version and creates
a new snapshot version using that data. This makes the version writeable
unlike checkout. This should be replaced once the feature is implemented
in pylance.
Co-authored-by: Chang She <chang@lancedb.com>
It's inconvenient to always require data at table creation time.
Here we enable you to create an empty table and add data and set schema
later.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Sometimes LangChain would insert a single `[np.nan]` as a placeholder if
the embedding function failed. This causes a problem for Lance format
because then the array can't be stored as a FixedSizedListArray.
Instead:
1. By default we remove rows with embedding lengths less than the
maximum length in the batch
2. If `strict=True` kwargs is set to True, then a `ValueError` is raised
if the embeddings aren't all the same length
---------
Co-authored-by: Chang She <chang@lancedb.com>
* to_df() is now async, added `to_df_blocking` to convenience
* add remote lancedb client to public lancedb
* make lancedb connection class understand url scheme
`lancedb+<connection_type>://<host>:<port>`.
pypi does not allow packages to be uploaded that has a direct reference
for now we'll just ask the user to install tantivy separately
---------
Co-authored-by: Chang She <chang@lancedb.com>
- Fixed `add` unit test to create the correct expected result
- Added a unit test for LanceTable.add
- Need to discuss if len(LanceTable) is handled correctly