## LanceDB Tables
A Table is a collection of Records in a LanceDB Database.

![illustration](../assets/ecosystem-illustration.png)

In [2]:
!pip install lancedb -qq

In [3]:
import lancedb
db = lancedb.connect("./.lancedb")

LanceDB allows ingesting data from various sources - `dict`, `list[dict]`, `pd.DataFrame`, `pa.Table` or a `Iterator[pa.RecordBatch]`. Let's take a look at some of the these.

 ### From list of tuples or dictionaries

In [4]:
import lancedb

db = lancedb.connect("./.lancedb")

data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
        {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}]

db.create_table("my_table", data)

db["my_table"].head()

pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

## From pandas DataFrame


In [5]:
import pandas as pd

data = pd.DataFrame(
    {
        "vector": [[1.1, 1.2, 1.3, 1.4], [0.2, 1.8, 0.4, 3.6]],
        "lat": [45.5, 40.1],
        "long": [-122.7, -74.1],
    }
)
db.create_table("my_table_pandas", data)
db["my_table_pandas"].head()

pyarrow.Table
vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: double
long: double
----
vector: [[[1.1,1.2],[0.2,1.8]]]
lat: [[45.5,40.1]]
long: [[-122.7,-74.1]]

Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
  

In [6]:
import pyarrow as pa

custom_schema = pa.schema([
pa.field("vector", pa.list_(pa.float32(), 4)),
pa.field("lat", pa.float32()),
pa.field("long", pa.float32())
])

table = db.create_table("table3", data, schema=custom_schema, mode="overwrite")
table.schema

[2024-01-31T18:59:33Z WARN  lance::dataset] No existing dataset at /Users/qian/Work/LanceDB/lancedb/docs/src/notebooks/.lancedb/table3.lance, it will be created


vector: fixed_size_list<item: float>[2]
  child 0, item: float
lat: float
long: float

### From an Arrow Table

You can also create LanceDB tables directly from pyarrow tables. LanceDB supports float16 type.

In [7]:
import numpy as np

dim = 16
total = 2
schema = pa.schema(
    [
        pa.field("vector", pa.list_(pa.float16(), dim)),
        pa.field("text", pa.string())
    ]
)
data = pa.Table.from_arrays(
    [
        pa.array([np.random.randn(dim).astype(np.float16) for _ in range(total)],
                pa.list_(pa.float16(), dim)),
        pa.array(["foo", "bar"])
    ],
    ["vector", "text"],
)

tbl = db.create_table("f16_tbl", data, schema=schema)
tbl.schema

vector: fixed_size_list<item: halffloat>[16]
  child 0, item: halffloat
text: string

### From Pydantic Models

LanceDB supports to create Apache Arrow Schema from a Pydantic BaseModel.

In [8]:
from lancedb.pydantic import Vector, LanceModel

class Content(LanceModel):
    movie_id: int
    vector: Vector(128)
    genres: str
    title: str
    imdb_id: int
        
    @property
    def imdb_url(self) -> str:
        return f"https://www.imdb.com/title/tt{self.imdb_id}"

import pyarrow as pa
db = lancedb.connect("~/.lancedb")
table_name = "movielens_small"
table = db.create_table(table_name, schema=Content)
table.schema

movie_id: int64 not null
vector: fixed_size_list<item: float>[128] not null
  child 0, item: float
genres: string not null
title: string not null
imdb_id: int64 not null

### Using Iterators / Writing Large Datasets

It is recommended to use itertators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches using `table.add()`

LanceDB additionally supports pyarrow's `RecordBatch` Iterators or other generators producing supported data types.

## Here's an example using using `RecordBatch` iterator for creating tables.

In [9]:
import pyarrow as pa

def make_batches():
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]],
                        pa.list_(pa.float32(), 2)),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"],
        )

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 2)),
    pa.field("item", pa.utf8()),
    pa.field("price", pa.float32()),
])

db.create_table("table4", make_batches(), schema=schema)

LanceTable(table4)

### Using pandas `DataFrame` Iterator and Pydantic Schema

You can set the schema via pyarrow schema object or using Pydantic object

In [10]:
import pyarrow as pa
import pandas as pd

class PydanticSchema(LanceModel):
    vector: Vector(2)
    item: str
    price: float

def make_batches():
    for i in range(5):
        yield pd.DataFrame(
                {
                    "vector": [[3.1, 4.1], [1, 1]],
                    "item": ["foo", "bar"],
                    "price": [10.0, 20.0],
                })

tbl = db.create_table("table5", make_batches(), schema=PydanticSchema)
tbl.schema

vector: fixed_size_list<item: float>[2] not null
  child 0, item: float
item: string not null
price: double not null

## Creating Empty Table

You can create an empty table by just passing the schema and later add to it using `table.add()`

In [11]:
import lancedb
from lancedb.pydantic import LanceModel, Vector

class Model(LanceModel):
      vector: Vector(2)

tbl = db.create_table("table6", schema=Model.to_arrow_schema())

## Open Existing Tables

If you forget the name of your table, you can always get a listing of all table names:


In [12]:
db.table_names()

['table6', 'table4', 'table5', 'movielens_small']

In [13]:
tbl = db.open_table("table4")
tbl.to_pandas()

Unnamed: 0,vector,item,price
0,"[3.1, 4.1]",foo,10.0
1,"[5.9, 26.5]",bar,20.0
2,"[3.1, 4.1]",foo,10.0
3,"[5.9, 26.5]",bar,20.0
4,"[3.1, 4.1]",foo,10.0
5,"[5.9, 26.5]",bar,20.0
6,"[3.1, 4.1]",foo,10.0
7,"[5.9, 26.5]",bar,20.0
8,"[3.1, 4.1]",foo,10.0
9,"[5.9, 26.5]",bar,20.0


## Adding to table
After a table has been created, you can always add more data to it using

You can add any of the valid data structures accepted by LanceDB table, i.e, `dict`, `list[dict]`, `pd.DataFrame`, or a `Iterator[pa.RecordBatch]`. Here are some examples.

In [14]:
data = [
        {"vector": [1.3, 1.4], "item": "fizz", "price": 100.0},
        {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}
]
tbl.add(data)

You can also add a large dataset batch in one go using Iterator of supported data types

### Adding via Iterator

here, we'll use pandas DataFrame Iterator

In [15]:
def make_batches():
    for i in range(5):
        yield [
                  {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                  {"vector": [1, 1], "item": "bar", "price": 20.0},
              ]
tbl.add(make_batches())

## Deleting from a Table

Use the `delete()` method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter, like:


```python
tbl.delete('item = "fizz"')
```


In [16]:
print(len(tbl))
      
tbl.delete("price = 20.0")
      
len(tbl)

22


12

### Delete from a list of values

In [17]:
to_remove = ["foo", "buzz"]
to_remove = ", ".join(str(v) for v in to_remove)
print(tbl.to_pandas())
tbl.delete(f"item IN ({to_remove})")


         vector  item  price
0    [3.1, 4.1]   foo   10.0
1    [3.1, 4.1]   foo   10.0
2    [3.1, 4.1]   foo   10.0
3    [3.1, 4.1]   foo   10.0
4    [3.1, 4.1]   foo   10.0
5    [1.3, 1.4]  fizz  100.0
6   [9.5, 56.2]  buzz  200.0
7    [3.1, 4.1]   foo   10.0
8    [3.1, 4.1]   foo   10.0
9    [3.1, 4.1]   foo   10.0
10   [3.1, 4.1]   foo   10.0
11   [3.1, 4.1]   foo   10.0


OSError: LanceError(IO): Error during planning: column foo does not exist, /Users/runner/work/lance/lance/rust/lance-core/src/error.rs:212:23

In [None]:
df = pd.DataFrame(
                    {
                        "vector": [[3.1, 4.1], [1, 1]],
                        "item": ["foo", "bar"],
                        "price": [10.0, 20.0],
                    })

tbl = db.create_table("table7", data=df, mode="overwrite")

In [None]:
to_remove = [10.0, 20.0]
to_remove = ", ".join(str(v) for v in to_remove)

tbl.delete(f"price IN ({to_remove})")

In [None]:
tbl.to_pandas()

Unnamed: 0,vector,item,price
