docs: add a section to describe scalar index (#1495)

2026-05-30 02:10:40 +00:00 · 2024-08-16 18:48:29 -07:00
parent 09ce6c5bb5
commit 5857cb4c6e
4 changed files with 125 additions and 11 deletions
--- a/docs/src/guides/scalar_index.md
+++ b/docs/src/guides/scalar_index.md
@@ -0,0 +1,108 @@
+# Building Scalar Index
+
+Similar to many SQL databases, LanceDB supports several types of Scalar indices to accelerate search
+over scalar columns.
+
+- `BTREE`: The most common type is BTREE. This index is inspired by the btree data structure
+  although only the first few layers of the btree are cached in memory.
+  It will perform well on columns with a large number of unique values and few rows per value.
+- `BITMAP`: this index stores a bitmap for each unique value in the column.
+  This index is useful for columns with a finite number of unique values and many rows per value.
+  For example, columns that represent "categories", "labels", or "tags"
+- `LABEL_LIST`: a special index that is used to index list columns whose values have a finite set of possibilities.
+  For example, a column that contains lists of tags (e.g. `["tag1", "tag2", "tag3"]`) can be indexed with a `LABEL_LIST` index.
+
+| Data Type                                                       | Filter                                    | Index Type   |
+| --------------------------------------------------------------- | ----------------------------------------- | ------------ |
+| Numeric, String, Temporal                                       | `<`, `=`, `>`, `in`, `between`, `is null` | `BTREE`      |
+| Boolean, numbers or strings with fewer than 1,000 unique values | `<`, `=`, `>`, `in`, `between`, `is null` | `BITMAP`     |
+| List of low cardinality of numbers or strings                   | `array_has_any`, `array_has_all`          | `LABEL_LIST` |
+
+=== "Python"
+
+    ```python
+    import lancedb
+    books = [
+      {"book_id": 1, "publisher": "plenty of books", "tags": ["fantasy", "adventure"]},
+      {"book_id": 2, "publisher": "book town", "tags": ["non-fiction"]},
+      {"book_id": 3, "publisher": "oreilly", "tags": ["textbook"]}
+    ]
+
+    db = lancedb.connect("./db")
+    table = db.create_table("books", books)
+    table.create_scalar_index("book_id")  # BTree by default
+    table.create_scalar_index("publisher", index_type="BITMAP")
+    ```
+
+=== "Typescript"
+
+    === "@lancedb/lancedb"
+
+        ```js
+        const db = await lancedb.connect("data");
+        const tbl = await db.openTable("my_vectors");
+
+        await tbl.create_index("book_id");
+        await tlb.create_index("publisher", { config: lancedb.Index.bitmap() })
+        ```
+
+For example, the following scan will be faster if the column `my_col` has a scalar index:
+
+=== "Python"
+
+    ```python
+    import lancedb
+
+    table = db.open_table("books")
+    my_df = table.search().where("book_id = 2").to_pandas()
+    ```
+
+=== "Typescript"
+
+    === "@lancedb/lancedb"
+
+        ```js
+        const db = await lancedb.connect("data");
+        const tbl = await db.openTable("books");
+
+        await tbl
+          .query()
+          .where("book_id = 2")
+          .limit(10)
+          .toArray();
+        ```
+
+Scalar indices can also speed up scans containing a vector search or full text search, and a prefilter:
+
+=== "Python"
+
+    ```python
+    import lancedb
+
+    data = [
+      {"book_id": 1, "vector": [1, 2]},
+      {"book_id": 2, "vector": [3, 4]},
+      {"book_id": 3, "vector": [5, 6]}
+    ]
+    table = db.create_table("book_with_embeddings", data)
+
+    (
+        table.search([1, 2])
+        .where("book_id != 3", prefilter=True)
+        .to_pandas()
+    )
+    ```
+
+=== "Typescript"
+
+    === "@lancedb/lancedb"
+
+        ```js
+        const db = await lancedb.connect("data/lance");
+        const tbl = await db.openTable("book_with_embeddings");
+
+        await tbl.search(Array(1536).fill(1.2))
+          .where("book_id != 3")  // prefilter is default behavior.
+          .limit(10)
+          .toArray();
+        ```