docs: add fine tuning section in retriever guide and minor fixes (#1438)

2026-06-04 21:00:44 +00:00 · 2024-07-11 17:34:29 +05:30
parent fdc949bafb
commit bb2e624ff0
7 changed files with 1531 additions and 3 deletions
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -105,6 +105,7 @@ nav:
              - Jina Reranker: reranking/jina.md
              - OpenAI Reranker: reranking/openai.md
              - Building Custom Rerankers: reranking/custom_reranker.md
+              - Example: notebooks/lancedb_reranking.ipynb
          - Filtering: sql.md
          - Versioning & Reproducibility: notebooks/reproducibility.ipynb
          - Configuring Storage: guides/storage.md
@@ -112,6 +113,7 @@ nav:
          - Tuning retrieval performance:
              - Choosing right query type: guides/tuning_retrievers/1_query_types.md
              - Reranking: guides/tuning_retrievers/2_reranking.md
+              - Embedding fine-tuning: guides/tuning_retrievers/3_embed_tuning.md
      - 🧬 Managing embeddings:
          - Overview: embeddings/index.md
          - Embedding functions: embeddings/embedding_functions.md
@@ -188,6 +190,7 @@ nav:
          - Jina Reranker: reranking/jina.md
          - OpenAI Reranker: reranking/openai.md
          - Building Custom Rerankers: reranking/custom_reranker.md
+          - Example: notebooks/lancedb_reranking.ipynb
      - Filtering: sql.md
      - Versioning & Reproducibility: notebooks/reproducibility.ipynb
      - Configuring Storage: guides/storage.md
@@ -195,6 +198,7 @@ nav:
      - Tuning retrieval performance:
          - Choosing right query type: guides/tuning_retrievers/1_query_types.md
          - Reranking: guides/tuning_retrievers/2_reranking.md
+          - Embedding fine-tuning: guides/tuning_retrievers/3_embed_tuning.md
  - Managing Embeddings:
      - Overview: embeddings/index.md
      - Embedding functions: embeddings/embedding_functions.md
--- a/docs/src/embeddings/default_embedding_functions.md
+++ b/docs/src/embeddings/default_embedding_functions.md
@@ -563,7 +563,7 @@ uris = [
 # get each uri as bytes
 image_bytes = [requests.get(uri).content for uri in uris]
 table.add(
-    [{"label": labels, "image_uri": uris, "image_bytes": image_bytes}]
+    pd.DataFrame({"label": labels, "image_uri": uris, "image_bytes": image_bytes})
 )
 ```
 Now we can search using text from both the default vector column and the custom vector column
--- a/docs/src/guides/tuning_retrievers/1_query_types.md
+++ b/docs/src/guides/tuning_retrievers/1_query_types.md
@@ -1,4 +1,7 @@
 ## Improving retriever performance
+
+Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
+
 VectorDBs are used as retreivers in recommender or chatbot-based systems for retrieving relevant data based on user queries. For example, retriever is a critical component of Retrieval Augmented Generation (RAG) acrhitectures. In this section, we will discuss how to improve the performance of retrievers.

 There are serveral ways to improve the performance of retrievers. Some of the common techniques are:
--- a/docs/src/guides/tuning_retrievers/2_reranking.md
+++ b/docs/src/guides/tuning_retrievers/2_reranking.md
@@ -1,4 +1,6 @@
-Continuing from the previous example, we can now rerank the results using more complex rerankers.
+Continuing from the previous section, we can now rerank the results using more complex rerankers.
+
+Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/lancedb_reranking.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>

 ## Reranking search results
 You can rerank any search results using a reranker. The syntax for reranking is as follows:
--- a/docs/src/guides/tuning_retrievers/3_embed_tuning.md
+++ b/docs/src/guides/tuning_retrievers/3_embed_tuning.md
@@ -0,0 +1,82 @@
+## Finetuning the Embedding Model
+Try it yourself - <a href="https://colab.research.google.com/github/lancedb/lancedb/blob/main/docs/src/notebooks/embedding_tuner.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a><br/>
+
+Another way to improve retriever performance is to fine-tune the embedding model itself. Fine-tuning the embedding model can help in learning better representations for the documents and queries in the dataset. This can be particularly useful when the dataset is very different from the pre-trained data used to train the embedding model.
+
+We'll use the same dataset as in the previous sections. Start off by splitting the dataset into training and validation sets:
+```python
+from sklearn.model_selection import train_test_split
+
+train_df, validation_df = train_test_split("data_qa.csv", test_size=0.2, random_state=42)
+
+train_df.to_csv("data_train.csv", index=False)
+validation_df.to_csv("data_val.csv", index=False)
+```
+
+You can use any tuning API to fine-tune embedding models. In this example, we'll utilise Llama-index as it also comes with utilities for synthetic data generation and training the model. 
+
+
+Then parse the dataset as llama-index text nodes and generate synthetic QA pairs from each node.
+```python
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.readers.file import PagedCSVReader
+from llama_index.finetuning import generate_qa_embedding_pairs
+from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
+
+def load_corpus(file):
+    loader = PagedCSVReader(encoding="utf-8")
+    docs = loader.load_data(file=Path(file))
+
+    parser = SentenceSplitter()
+    nodes = parser.get_nodes_from_documents(docs)
+
+    return nodes
+
+from llama_index.llms.openai import OpenAI
+
+
+train_dataset = generate_qa_embedding_pairs(
+    llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes, verbose=False
+)
+val_dataset = generate_qa_embedding_pairs(
+    llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes, verbose=False
+)
+```
+
+Now we'll use `SentenceTransformersFinetuneEngine` engine to fine-tune the model. You can also use `sentence-transformers` or `transformers` library to fine-tune the model. 
+
+```python
+from llama_index.finetuning import SentenceTransformersFinetuneEngine
+
+finetune_engine = SentenceTransformersFinetuneEngine(
+    train_dataset,
+    model_id="BAAI/bge-small-en-v1.5",
+    model_output_path="tuned_model",
+    val_dataset=val_dataset,
+)
+finetune_engine.finetune()
+embed_model = finetune_engine.get_finetuned_model()
+```
+This saves the fine tuned embedding model in `tuned_model` folder. This al
+
+# Evaluation results
+In order to eval the retriever, you can either use this model to ingest the data into LanceDB directly or llama-index's LanceDB integration to create a `VectorStoreIndex` and use it as a retriever. 
+On performing the same hit-rate evaluation as before, we see a significant improvement in the hit-rate across all query types.
+
+### Baseline
+| Query Type | Hit-rate@5 |
+| --- | --- |
+| Vector Search | 0.640 |
+| Full-text Search | 0.595 |
+| Reranked Vector Search | 0.677 |
+| Reranked Full-text Search | 0.672 |
+| Hybrid Search (w/ CohereReranker) | 0.759|
+
+### Fine-tuned model ( 2 iterations )
+| Query Type | Hit-rate@5 |
+| --- | --- |
+| Vector Search | 0.672 |
+| Full-text Search | 0.595 |
+| Reranked Vector Search | 0.754 |
+| Reranked Full-text Search | 0.672|
+| Hybrid Search (w/ CohereReranker) | 0.768 |
--- a/docs/src/notebooks/embedding_tuner.ipynb
+++ b/docs/src/notebooks/embedding_tuner.ipynb
--- a/docs/src/notebooks/lancedb_reranking.ipynb
+++ b/docs/src/notebooks/lancedb_reranking.ipynb
@@ -6,7 +6,7 @@
        "id": "b3Y3DOVqtIbc"
      },
      "source": [
-        "# Example walkthrough\n",
+        "# Example - Improve Retrievers using Rerankers & Hybrid search\n",
        "\n",
        "## Optimizing RAG retrieval performance using hybrid search & reranking"
      ]