docs: Updates and refactor (#683)

This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>
2026-01-06 03:42:57 +00:00 · 2024-01-18 13:48:37 -05:00
parent 33ab68c790
commit 4d5d748acd
58 changed files with 1386 additions and 763 deletions
--- a/docs/src/embeddings/embedding_functions.md
+++ b/docs/src/embeddings/embedding_functions.md
@@ -1,71 +1,70 @@
-Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions themselves can be thought of as a part of the processing pipeline that each request(input) has to be passed through. After initial setup these components are not expected to change for a particular project. 
+Representing multi-modal data as vector embeddings is becoming a standard practice. Embedding functions can themselves be thought of as key part of the data processing pipeline that each request has to be passed through. The assumption here is: after initial setup, these components and the underlying methodology are not expected to change for a particular project.

-Our new embedding functions API allow you simply set it up once and the table remembers it, effectively making the **embedding functions disappear in the background** so you don't have to worry about modelling and can simply focus on the DB aspects of VectorDB.
+For this purpose, LanceDB introduces an **embedding functions API**, that allow you simply set up once, during the configuration stage of your project. After this, the table remembers it, effectively making the embedding functions *disappear in the background* so you don't have to worry about manually passing callables, and instead, simply focus on the rest of your data engineering pipeline.

-You can simply follow these steps and forget about the details of your embedding functions as long as you don't intend to change it.
+!!! warning
+    Using the implicit embeddings management approach means that you can forget about the manually passing around embedding
+    functions in your code, as long as you don't intend to change it at a later time. If your embedding function changes,
+    you'll have to re-configure your table with the new embedding function and regenerate the embeddings.

-### Step 1 - Define the embedding function
-We have some pre-defined embedding functions in the global registry with more coming soon. Here's let's an implementation of CLIP as example.
+## 1. Define the embedding function
+We have some pre-defined embedding functions in the global registry, with more coming soon. Here's let's an implementation of CLIP as example.
 ```
-from lancedb.embeddings import EmbeddingFunctionRegistry
-
 registry = EmbeddingFunctionRegistry.get_instance()
 clip = registry.get("open-clip").create()

 ```
-You can also define your own embedding function by implementing the `EmbeddingFunction` abstract base interface. It subclasses PyDantic Model which can be utilized to write complex schemas simply as we'll see next!
+You can also define your own embedding function by implementing the `EmbeddingFunction` abstract base interface. It subclasses Pydantic Model which can be utilized to write complex schemas simply as we'll see next!

-### Step 2 - Define the Data Model or Schema
-Our embedding function from the previous section abstracts away all the details about the models and dimensions required to define the schema. You can simply set a field as **source** or **vector** column. Here's how
+## 2. Define the data model or schema
+The embedding function defined above abstracts away all the details about the models and dimensions required to define the schema. You can simply set a field as **source** or **vector** column. Here's how:

 ```python
-from lancedb.pydantic import LanceModel, Vector
-
 class Pets(LanceModel):
    vector: Vector(clip.ndims) = clip.VectorField()
    image_uri: str = clip.SourceField()
-
 ```
-`VectorField` tells LanceDB to use the clip embedding function to generate query embeddings for `vector` column & `SourceField` tells that when adding data, automatically use the embedding function to encode `image_uri`.

+`VectorField` tells LanceDB to use the clip embedding function to generate query embeddings for the `vector` column and `SourceField` ensures that when adding data, we automatically use the specified embedding function to encode `image_uri`.

-### Step 3 - Create LanceDB Table
-Now that we have chosen/defined our embedding function and the schema, we can create the table
+## 3. Create LanceDB table
+Now that we have chosen/defined our embedding function and the schema, we can create the table:

 ```python
-import lancedb
-
 db = lancedb.connect("~/lancedb")
 table = db.create_table("pets", schema=Pets)
+
 ```

-That's it! We have ingested all the information needed to embed source and query inputs. We can now forget about the model and dimension details and start to build our VectorDB.
+That's it! We've provided all the information needed to embed the source and query inputs. We can now forget about the model and dimension details and start to build our VectorDB pipeline.

-### Step 4 - Ingest lots of data and run vector search!
-Now you can just add the data and it'll be vectorized automatically
+## 4. Ingest lots of data and query your table
+Any new or incoming data can just be added and it'll be vectorized automatically.

 ```python
 table.add([{"image_uri": u} for u in uris])
 ```

-Our OpenCLIP query embedding function support querying via both text and images. 
+Our OpenCLIP query embedding function supports querying via both text and images:

 ```python
 result = table.search("dog")
 ```

-Let's query an image
+Let's query an image:

 ```python
-from pathlib import Path
-
 p = Path("path/to/images/samoyed_100.jpg")
 query_image = Image.open(p)
 table.search(query_image)
 ```

-### Rate limit Handling
-`EmbeddingFunction` class wraps the calls for source and query embedding generation inside a rate limit handler that retries the requests with exponential backoff after successive failures. By default the maximum retires is set to 7. You can tune it by setting it to a different number or disable it by setting it to 0. Example:
+---
+
+## Rate limit Handling
+`EmbeddingFunction` class wraps the calls for source and query embedding generation inside a rate limit handler that retries the requests with exponential backoff after successive failures. By default, the maximum retires is set to 7. You can tune it by setting it to a different number, or disable it by setting it to 0.
+
+An example of how to do this is shown below:

 ```python
 clip = registry.get("open-clip").create() # Defaults to 7 max retries
@@ -73,16 +72,17 @@ clip = registry.get("open-clip").create(max_retries=10) # Increase max retries t
 clip = registry.get("open-clip").create(max_retries=0) # Retries disabled
 ```

-NOTE:
-Embedding functions can also fail due to other errors that have nothing to do with rate limits. This is why the errors are also logged.
+!!! note
+    Embedding functions can also fail due to other errors that have nothing to do with rate limits.
+    This is why the error is also logged.

-### A little fun with PyDantic
-LanceDB is integrated with PyDantic. In fact, we've used the integration in the above example to define the schema. It is also being used behind the scene by the embedding function API to ingest useful information as table metadata.
-You can also use it for adding utility operations in the schema. For example, in our multi-modal example, you can search images using text or another image. Let's define a utility function to plot the image.
+## Some fun with Pydantic
+
+LanceDB is integrated with Pydantic, which was used in the example above to define the schema in Python. It's also used behind the scenes by the embedding function API to ingest useful information as table metadata.
+
+You can also use the integration for adding utility operations in the schema. For example, in our multi-modal example, you can search images using text or another image. Let's define a utility function to plot the image.

 ```python
-from lancedb.pydantic import LanceModel, Vector
-
 class Pets(LanceModel):
    vector: Vector(clip.ndims) = clip.VectorField()
    image_uri: str = clip.SourceField()
@@ -91,8 +91,7 @@ class Pets(LanceModel):
    def image(self):
        return Image.open(self.image_uri)
 ```
-
-Now, you can covert your search results to PyDantic model and use its property.
+Now, you can covert your search results to a Pydantic model and use this property.

 ```python
 rs = table.search(query_image).limit(3).to_pydantic(Pets)
@@ -101,4 +100,4 @@ rs[2].image

 ![](../assets/dog_clip_output.png)

-Now that you have the basic idea about LanceDB embedding function, let us dive deeper into the API that you can use to implement your own embedding functions!
+Now that you have the basic idea about implicit management via embedding functions, let's dive deeper into a [custom API](./api.md) that you can use to implement your own embedding functions.