This PR adds an overview of embeddings docs:
- 2 ways to vectorize your data using lancedb - explicit & implicit
- explicit - manually vectorize your data using `wit_embedding` function
- Implicit - automatically vectorize your data as it comes by ingesting
your embedding function details as table metadata
- Multi-modal example w/ disappearing embedding function
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes#555
Add `to_pandas` API and add deprecation warning on `to_df`. Closes#545
Co-authored-by: Chang She <chang@lancedb.com>
We have experimental support for prefiltering (without ANN) in pylance.
This means that we can now apply a filter BEFORE vector search is
performed. This can be done via the `.where(filter_string,
prefilter=True)` kwargs of the query.
Limitations:
- When connecting to LanceDB cloud, `prefilter=True` will raise
NotImplemented
- When an ANN index is present, `prefilter=True` will raise
NotImplemented
- This option is not available for full text search query
- This option is not available for empty search query (just
filter/project)
Additional changes in this PR:
- Bump pylance version to v0.8.0 which supports the experimental
prefiltering.
---------
Co-authored-by: Chang She <chang@lancedb.com>
The `attr` project is unrelated to `attrs` that also provides the `attr`
namespace (see also <https://hynek.me/articles/import-attrs/>).
It used to _usually_ work, because attrs is a dependency of aiohttp and
somehow took precedence over `attr`'s `attr`.
Yes, sorry, it's a mess.
1. Support persistent embedding function so users can just search using
query string
2. Add fixed size list conversion for multiple vector columns
3. Add support for empty query (just apply select/where/limit).
4. Refactor and simplify some of the data prep code
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Combine delete and append to make a temporary update feature that is
only enabled for the local python lancedb.
The reason why this is temporary is because it first has to load the
data that matches the where clause into memory, which is technical
unbounded.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Previously if you needed to add a column to a table you'd have to
rewrite the whole table. Instead,
we use the merge functionality from Lance format
to incrementally add columns from another table
or dataframe.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Previously the temporary restore feature required copying data. The new
feature in pylance does not.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
It improves the UX as iterators can be of any type supported by the
table (plus recordbatch) & there is no separate requirement.
Also expands the test cases for pydantic & arrow schema.
If this is looks good I'll update the docs.
Example usage:
```
class Content(LanceModel):
vector: vector(2)
item: str
price: float
def make_batches():
for _ in range(5):
yield from [
# pandas
pd.DataFrame({
"vector": [[3.1, 4.1], [1, 1]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}),
# pylist
[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
# recordbatch
pa.RecordBatch.from_arrays(
[
pa.array([[3.1, 4.1], [5.9, 26.5]], pa.list_(pa.float32(), 2)),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
),
# pydantic list
[
Content(vector=[3.1, 4.1], item="foo", price=10.0),
Content(vector=[5.9, 26.5], item="bar", price=20.0),
]]
db = lancedb.connect("db")
tbl = db.create_table("tabley", make_batches(), schema=Content, mode="overwrite")
tbl.add(make_batches())
```
Same should with arrow schema.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
This adds LanceTable.restore as a temporary feature. It reads data from
a previous version and creates
a new snapshot version using that data. This makes the version writeable
unlike checkout. This should be replaced once the feature is implemented
in pylance.
Co-authored-by: Chang She <chang@lancedb.com>
BREAKING CHANGE: The `score` column has been renamed to `_distance` to
more accurately describe the semantics (smaller means closer / better).
---------
Co-authored-by: Lei Xu <lei@lancedb.com>
#416 Fixed.
added drop_database() method . This deletes all the tables from the
database with a single command.
---------
Signed-off-by: Ashis Kumar Naik <ashishami2002@gmail.com>
Saves users from having to explicitly call
`LanceModel.to_arrow_schema()` when creating an empty table.
See new docs for full details.
---------
Co-authored-by: Chang She <chang@lancedb.com>
Makes the following work so all the formats accepted by `create_table()`
are also accepted by `add()`
```
import lancedb
import pyarrow as pa
db = lancedb.connect("/tmp")
def make_batches():
for i in range(5):
yield pa.RecordBatch.from_arrays(
[
pa.array([[3.1, 4.1], [5.9, 26.5]]),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)
schema = pa.schema([
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
])
tbl = db.create_table("table4", make_batches(), schema=schema)
tbl.add(make_batches())
```