docs: Updates and refactor (#683)

This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 ## Chores - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. ## Restructure/new content - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>
2026-05-31 19:00:39 +00:00 · 2024-01-18 13:48:37 -05:00
parent 8bcdc81fd3
commit 119b928a52
59 changed files with 1406 additions and 770 deletions
--- a/docs/src/guides/tables.md
+++ b/docs/src/guides/tables.md
@@ -1,12 +1,15 @@
- <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/tables_guide.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
+
+<a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/tables_guide.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
+
 A Table is a collection of Records in a LanceDB Database. Tables in Lance have a schema that defines the columns and their types. These schemas can include nested columns and can evolve over time.

-This guide will show how to create tables, insert data into them, and update the data. You can follow along on colab!
+This guide will show how to create tables, insert data into them, and update the data.  
+

 ## Creating a LanceDB Table

 === "Python"
-    ### LanceDB Connection
+    Initialize a LanceDB connection and create a table using one of the many methods listed below.

    ```python
    import lancedb
@@ -48,7 +51,7 @@ This guide will show how to create tables, insert data into them, and update the
        db.create_table("name", data, mode="overwrite")
        ```

-    ### From pandas DataFrame
+    ### From a Pandas DataFrame

    ```python
    import pandas as pd
@@ -59,9 +62,9 @@ This guide will show how to create tables, insert data into them, and update the
        "long": [-122.7, -74.1]
    })

-    db.create_table("table2", data)
+    db.create_table("my_table", data)

-    db["table2"].head()
+    db["my_table"].head()
    ```
    !!! info "Note"
        Data is converted to Arrow before being written to disk. For maximum control over how data is saved, either provide the PyArrow schema to convert to or else provide a PyArrow Table directly.
@@ -73,11 +76,11 @@ This guide will show how to create tables, insert data into them, and update the
    pa.field("long", pa.float32())
    ])

-    table = db.create_table("table3", data, schema=custom_schema)
+    table = db.create_table("my_table", data, schema=custom_schema)
    ```

    ### From PyArrow Tables
-    You can also create LanceDB tables directly from pyarrow tables
+    You can also create LanceDB tables directly from PyArrow tables

    ```python
    table = pa.Table.from_arrays(
@@ -92,18 +95,18 @@ This guide will show how to create tables, insert data into them, and update the

    db = lancedb.connect("db")

-    tbl = db.create_table("test1", table)
+    tbl = db.create_table("my_table", table)
    ```

    ### From Pydantic Models
    When you create an empty table without data, you must specify the table schema.
-    LanceDB supports creating tables by specifying a pyarrow schema or a specialized
-    pydantic model called `LanceModel`.
+    LanceDB supports creating tables by specifying a PyArrow schema or a specialized
+    Pydantic model called `LanceModel`.

    For example, the following Content model specifies a table with 5 columns:
-    movie_id, vector, genres, title, and imdb_id. When you create a table, you can
+    `movie_id`, `vector`, `genres`, `title`, and `imdb_id`. When you create a table, you can
    pass the class as the value of the `schema` parameter to `create_table`.
-    The `vector` column is a `Vector` type, which is a specialized pydantic type that
+    The `vector` column is a `Vector` type, which is a specialized Pydantic type that
    can be configured with the vector dimensions. It is also important to note that
    LanceDB only understands subclasses of `lancedb.pydantic.LanceModel`
    (which itself derives from `pydantic.BaseModel`).
@@ -167,8 +170,8 @@ This guide will show how to create tables, insert data into them, and update the

    #### Validators

-    Note that neither pydantic nor pyarrow automatically validates that input data
-    is of the *correct* timezone, but this is easy to add as a custom field validator:
+    Note that neither Pydantic nor PyArrow automatically validates that input data
+    is of the correct timezone, but this is easy to add as a custom field validator:

    ```python
    from datetime import datetime
@@ -208,9 +211,9 @@ This guide will show how to create tables, insert data into them, and update the

    ### Using Iterators / Writing Large Datasets

-    It is recommended to use itertators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches using `table.add()`
+    It is recommended to use iterators to add large datasets in batches when creating your table in one go. This does not create multiple versions of your dataset unlike manually adding batches using `table.add()`

-    LanceDB additionally supports pyarrow's `RecordBatch` Iterators or other generators producing supported data types.
+    LanceDB additionally supports PyArrow's `RecordBatch` Iterators or other generators producing supported data types.

    Here's an example using using `RecordBatch` iterator for creating tables.

@@ -235,47 +238,13 @@ This guide will show how to create tables, insert data into them, and update the
        pa.field("price", pa.float32()),
    ])

-    db.create_table("table4", make_batches(), schema=schema)
+    db.create_table("batched_tale", make_batches(), schema=schema)
    ```

-    You can also use iterators of other types like Pandas dataframe or Pylists directly in the above example.
+    You can also use iterators of other types like Pandas DataFrame or Pylists directly in the above example.

-    ## Creating Empty Table
-    You can create empty tables in python. Initialize it with schema and later ingest data into it.
-
-    ```python
-    import lancedb
-    import pyarrow as pa
-
-    schema = pa.schema(
-      [
-          pa.field("vector", pa.list_(pa.float32(), 2)),
-          pa.field("item", pa.string()),
-          pa.field("price", pa.float32()),
-      ])
-    tbl = db.create_table("table5", schema=schema)
-    data = [
-        {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
-        {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
-    ]
-    tbl.add(data=data)
-    ```
-
-    You can also use Pydantic to specify the schema
-
-    ```python
-    import lancedb
-    from lancedb.pydantic import LanceModel, vector
-
-    class Model(LanceModel):
-          vector: Vector(2)
-
-    tbl = db.create_table("table5", schema=Model.to_arrow_schema())
-    ```
-
-=== "Javascript/Typescript"
-
-    ### VectorDB Connection
+=== "JavaScript"
+    Initialize a VectorDB connection and create a table using one of the many methods listed below.

    ```javascript
    const lancedb = require("vectordb");
@@ -284,15 +253,18 @@ This guide will show how to create tables, insert data into them, and update the
    const db = await lancedb.connect(uri);
    ```

-    ### Creating a Table
-
-    You can create a LanceDB table in javascript using an array of records.
+    You can create a LanceDB table in JavaScript using an array of JSON records as follows.

    ```javascript
-    data
-    const tb = await db.createTable("my_table",
-      [{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
-       {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
+    const tb = await db.createTable("my_table", [{
+        "vector": [3.1, 4.1],
+        "item": "foo",
+        "price": 10.0
+    }, {
+        "vector": [5.9, 26.5],
+        "item": "bar",
+        "price": 20.0
+    }]);
    ```

    !!! info "Note"
@@ -304,81 +276,146 @@ This guide will show how to create tables, insert data into them, and update the

 ## Open existing tables

-If you forget the name of your table, you can always get a listing of all table names:
-
-
 === "Python"
-    ### Get a list of existing Tables
+    If you forget the name of your table, you can always get a listing of all table names.

    ```python
    print(db.table_names())
    ```
-=== "Javascript/Typescript"
+
+    Then, you can open any existing tables.
+
+    ```python
+    tbl = db.open_table("my_table")
+    ```
+
+=== "JavaScript"
+    If you forget the name of your table, you can always get a listing of all table names.

    ```javascript
    console.log(await db.tableNames());
    ```

-Then, you can open any existing tables
-
-=== "Python"
-
-    ```python
-    tbl = db.open_table("my_table")
-    ```
-=== "Javascript/Typescript"
+    Then, you can open any existing tables.

    ```javascript
    const tbl = await db.openTable("my_table");
    ```

-## Adding to a Table
-After a table has been created, you can always add more data to it using
+## Creating empty table

 === "Python"
-    You can add any of the valid data structures accepted by LanceDB table, i.e, `dict`, `list[dict]`, `pd.DataFrame`, or a `Iterator[pa.RecordBatch]`. Here are some examples.
+    In Python, you can create an empty table for scenarios where you want to add data to the table later. An example would be when you want to collect data from a stream/external file and then add it to a table in batches.

-    ### Adding Pandas DataFrame
+    ```python
+
+    An empty table can be initialized via a PyArrow schema.
+
+    ```python
+    import lancedb
+    import pyarrow as pa
+
+    schema = pa.schema(
+      [
+          pa.field("vector", pa.list_(pa.float32(), 2)),
+          pa.field("item", pa.string()),
+          pa.field("price", pa.float32()),
+      ])
+    tbl = db.create_table("empty_table_add", schema=schema)
+    ```
+
+    Alternatively, you can also use Pydantic to specify the schema for the empty table. Note that we do not 
+    directly import `pydantic` but instead use `lancedb.pydantic` which is a subclass of `pydantic.BaseModel` 
+    that has been extended to support LanceDB specific types like `Vector`.
+
+    ```python
+    import lancedb
+    from lancedb.pydantic import LanceModel, vector
+
+    class Item(LanceModel):
+        vector: Vector(2)
+        item: str
+        price: float
+
+    tbl = db.create_table("empty_table_add", schema=Item.to_arrow_schema())
+    ```
+
+    Once the empty table has been created, you can add data to it via the various methods listed in the [Adding to a table](#adding-to-a-table) section.
+
+## Adding to a table
+
+After a table has been created, you can always add more data to it using the various methods available.
+
+=== "Python"
+    You can add any of the valid data structures accepted by LanceDB table, i.e, `dict`, `list[dict]`, `pd.DataFrame`, or `Iterator[pa.RecordBatch]`. Below are some examples.
+
+    ### Add a Pandas DataFrame

    ```python
    df = pd.DataFrame({
-        "vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["fizz", "buzz"], "price": [100.0, 200.0]
+        "vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0]
    })
    tbl.add(df)
    ```

-    You can also add a large dataset batch in one go using Iterator of any supported data types.
+    ### Add an Iterator

-    ### Adding to table using Iterator
+    You can also add a large dataset batch in one go using Iterator of any supported data types.

    ```python
    def make_batches():
        for i in range(5):
            yield [
-                    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
-                    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
+                    {"vector": [3.1, 4.1], "item": "peach", "price": 6.0},
+                    {"vector": [5.9, 26.5], "item": "pear", "price": 5.0}
                ]
    tbl.add(make_batches())
    ```

-    The other arguments accepted:
+    ### Add a PyArrow table

-    | Name | Type | Description | Default |
-    |---|---|---|---|
-    | data | DATA | The data to insert into the table. | required |
-    | mode | str | The mode to use when writing the data. Valid values are "append" and "overwrite". | append |
-    | on_bad_vectors | str | What to do if any of the vectors are not the same size or contains NaNs. One of "error", "drop", "fill". | drop |
-    | fill value | float | The value to use when filling vectors: Only used if on_bad_vectors="fill". | 0.0 |
+    If you have data coming in as a PyArrow table, you can add it directly to the LanceDB table.

+    ```python
+    pa_table = pa.Table.from_arrays(
+            [
+                pa.array([[9.1, 6.7], [9.9, 31.2]],
+                        pa.list_(pa.float32(), 2)),
+                pa.array(["mango", "orange"]),
+                pa.array([7.0, 4.0]),
+            ],
+            ["vector", "item", "price"],
+        )

-=== "Javascript/Typescript"
-
-    ```javascript
-    await tbl.add([{vector: [1.3, 1.4], item: "fizz", price: 100.0},
-        {vector: [9.5, 56.2], item: "buzz", price: 200.0}])
+    tbl.add(pa_table)
    ```

-## Deleting from a Table
+    ### Add a Pydantic Model
+
+    Assuming that a table has been created with the correct schema as shown [above](#creating-empty-table), you can add data items that are valid Pydantic models to the table.
+
+    ```python
+    pydantic_model_items = [
+        Item(vector=[8.1, 4.7], item="pineapple", price=10.0),
+        Item(vector=[6.9, 9.3], item="avocado", price=9.0)
+    ]
+
+    tbl.add(pydantic_model_items)
+    ```
+
+
+=== "JavaScript"
+
+    ```javascript
+    await tbl.add(
+        [
+            {vector: [1.3, 1.4], item: "fizz", price: 100.0},
+            {vector: [9.5, 56.2], item: "buzz", price: 200.0}
+        ]
+    )
+    ```
+
+## Deleting from a table

 Use the `delete()` method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. This can delete any number of rows that match the filter.

@@ -423,7 +460,7 @@ Use the `delete()` method on tables to delete rows from a table. To choose which
    # 0  3  [5.0, 6.0]
    ```

-=== "Javascript/Typescript"
+=== "JavaScript"

    ```javascript
    await tbl.delete('item = "fizz"')
@@ -451,7 +488,7 @@ Use the `delete()` method on tables to delete rows from a table. To choose which
    await tbl.countRows() // Returns 1
    ```

-## Updating a Table
+## Updating a table

 This can be used to update zero to all rows depending on how many rows match the where clause. The update queries follow the form of a SQL UPDATE statement. The `where` parameter is a SQL filter that matches on the metadata columns. The `values` or `values_sql` parameters are used to provide the new values for the columns.

@@ -463,7 +500,7 @@ This can be used to update zero to all rows depending on how many rows match the

 !!! info "SQL syntax"

-    See [SQL filters](sql.md) for more information on the supported SQL syntax.
+    See [SQL filters](../sql.md) for more information on the supported SQL syntax.

 !!! warning "Warning"

@@ -502,9 +539,9 @@ This can be used to update zero to all rows depending on how many rows match the
    2  2  [10.0, 10.0]
    ```

-=== "Javascript/Typescript"
+=== "JavaScript/Typescript"

-    API Reference: [vectordb.Table.update](../../javascript/interfaces/Table/#update)
+    API Reference: [vectordb.Table.update](../javascript/interfaces/Table.md/#update)

    ```javascript
    const lancedb = require("vectordb");
@@ -540,7 +577,7 @@ The `values` parameter is used to provide the new values for the columns as lite
    2  3  [10.0, 10.0]
    ```

-=== "Javascript/Typescript"
+=== "JavaScript/Typescript"

    ```javascript
    await tbl.update({ valuesSql: { x: "x + 1" } })
@@ -550,7 +587,6 @@ The `values` parameter is used to provide the new values for the columns as lite

    When rows are updated, they are moved out of the index. The row will still show up in ANN queries, but the query will not be as fast as it would be if the row was in the index. If you update a large proportion of rows, consider rebuilding the index afterwards.

+## What's next?

-## What's Next?
-
-Learn how to Query your tables and create indices
+Learn the best practices on creating an ANN index and getting the most out of it.