lancedb/default_embedding_functions.md at 4d5d748acd3cab21d1ebf6b0cbe6f9ce963d31d4

mirror of https://github.com/lancedb/lancedb.git synced 2026-01-08 12:52:58 +00:00

Files

Prashanth Rao 4d5d748acd docs: Updates and refactor (#683 )

This PR makes incremental changes to the documentation.

* Closes #697
* Closes #698

- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>

2024-04-05 16:27:12 -07:00

11 KiB

Raw Blame History

There are various embedding functions available out of the box with LanceDB to manage your embeddings implicitly. We're actively working on adding other popular embedding APIs and models.

Text embedding functions

Contains the text embedding functions registered by default.

Embedding functions have an inbuilt rate limit handler wrapper for source and query embedding function calls that retry with exponential backoff.
Each EmbeddingFunction implementation automatically takes max_retries as an argument which has the default value of 7.

Sentence transformers

Allows you to set parameters when registering a sentence-transformers object.

Parameter	Type	Default Value	Description
`name`	`str`	`all-MiniLM-L6-v2`	The name of the model
`device`	`str`	`cpu`	The device to run the model on (can be `cpu` or `gpu`)
`normalize`	`bool`	`True`	Whether to normalize the input text before feeding it to the model

db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("sentence-transformers").create(device="cpu")

class Words(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("words", schema=Words)
table.add(
    [
        {"text": "hello world"}
        {"text": "goodbye world"}
    ]
)

query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)

OpenAI embeddings

LanceDB registers the OpenAI embeddings function in the registry by default, as openai. Below are the parameters that you can customize when creating the instances:

Parameter	Type	Default Value	Description
`name`	`str`	`"text-embedding-ada-002"`	The name of the model.

db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("openai").create()

class Words(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("words", schema=Words)
table.add(
    [
        {"text": "hello world"}
        {"text": "goodbye world"}
    ]
    )

query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
print(actual.text)

Instructor Embeddings

Instructor is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g. classification, retrieval, clustering, text evaluation, etc.) and domains (e.g. science, finance, etc.) by simply providing the task instruction, without any finetuning.

If you want to calculate customized embeddings for specific sentences, you can follow the unified template to write instructions.

!!! info Represent the domain text_type for task_objective:

* `domain` is optional, and it specifies the domain of the text, e.g. science, finance, medicine, etc.
* `text_type` is required, and it specifies the encoding unit, e.g. sentence, document, paragraph, etc.
* `task_objective` is optional, and it specifies the objective of embedding, e.g. retrieve a document, classify the sentence, etc.

More information about the model can be found at the source URL.

Argument	Type	Default	Description
`name`	`str`	"hkunlp/instructor-base"	The name of the model to use
`batch_size`	`int`	`32`	The batch size to use when generating embeddings
`device`	`str`	`"cpu"`	The device to use when generating embeddings
`show_progress_bar`	`bool`	`True`	Whether to show a progress bar when generating embeddings
`normalize_embeddings`	`bool`	`True`	Whether to normalize the embeddings
`quantize`	`bool`	`False`	Whether to quantize the model
`source_instruction`	`str`	`"represent the docuement for retreival"`	The instruction for the source column
`query_instruction`	`str`	`"represent the document for retreiving the most similar documents"`	The instruction for the query

import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry, InstuctorEmbeddingFunction

instructor = get_registry().get("instructor").create(
                            source_instruction="represent the docuement for retreival",
                            query_instruction="represent the document for retreiving the most similar documents"
                            )

class Schema(LanceModel):
    vector: Vector(instructor.ndims()) = instructor.VectorField()
    text: str = instructor.SourceField()

db = lancedb.connect("~/.lancedb")
tbl = db.create_table("test", schema=Schema, mode="overwrite")

texts = [{"text": "Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that..."},
        {"text": "The disparate impact theory is especially controversial under the Fair Housing Act because the Act..."},
        {"text": "Disparate impact in United States labor law refers to practices in employment, housing, and other areas that.."}]

tbl.add(texts)

Gemini Embedding Function

With Google's Gemini, you can represent text (words, sentences, and blocks of text) in a vectorized form, making it easier to compare and contrast embeddings. For example, two texts that share a similar subject matter or sentiment should have similar embeddings, which can be identified through mathematical comparison techniques such as cosine similarity. For more on how and why you should use embeddings, refer to the Embeddings guide. The Gemini Embedding Model API supports various task types:

Task Type	Description
"`retrieval_query`"	Specifies the given text is a query in a search/retrieval setting.
"`retrieval_document`"	Specifies the given text is a document in a search/retrieval setting. Using this task type requires a title but is automatically proided by Embeddings API
"`semantic_similarity`"	Specifies the given text will be used for Semantic Textual Similarity (STS).
"`classification`"	Specifies that the embeddings will be used for classification.
"`clusering`"	Specifies that the embeddings will be used for clustering.

Usage Example:

import lancedb
import pandas as pd
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry


model = get_registry().get("gemini-text").create()

class TextModel(LanceModel):
    text: str = model.SourceField()
    vector: Vector(model.ndims()) = model.VectorField()

df = pd.DataFrame({"text": ["hello world", "goodbye world"]})
db = lancedb.connect("~/.lancedb")
tbl = db.create_table("test", schema=TextModel, mode="overwrite")

tbl.add(df)
rs = tbl.search("hello").limit(1).to_pandas()

Multi-modal embedding functions allow you to query your table using both images and text.

OpenClip embeddings

We support CLIP model embeddings using the open source alternative, open-clip which supports various customizations. It is registered as open-clip and supports the following customizations:

Parameter	Type	Default Value	Description
`name`	`str`	`"ViT-B-32"`	The name of the model.
`pretrained`	`str`	`"laion2b_s34b_b79k"`	The name of the pretrained model to load.
`device`	`str`	`"cpu"`	The device to run the model on. Can be `"cpu"` or `"gpu"`.
`batch_size`	`int`	`64`	The number of images to process in a batch.
`normalize`	`bool`	`True`	Whether to normalize the input images before feeding them to the model.

This embedding function supports ingesting images as both bytes and urls. You can query them using both test and other images.

!!! info LanceDB supports ingesting images directly from accessible links.


db = lancedb.connect(tmp_path)
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("open-clip").create()

class Images(LanceModel):
    label: str
    image_uri: str = func.SourceField() # image uri as the source
    image_bytes: bytes = func.SourceField() # image bytes as the source
    vector: Vector(func.ndims()) = func.VectorField() # vector column 
    vec_from_bytes: Vector(func.ndims()) = func.VectorField() # Another vector column 

table = db.create_table("images", schema=Images)
labels = ["cat", "cat", "dog", "dog", "horse", "horse"]
uris = [
    "http://farm1.staticflickr.com/53/167798175_7c7845bbbd_z.jpg",
    "http://farm1.staticflickr.com/134/332220238_da527d8140_z.jpg",
    "http://farm9.staticflickr.com/8387/8602747737_2e5c2a45d4_z.jpg",
    "http://farm5.staticflickr.com/4092/5017326486_1f46057f5f_z.jpg",
    "http://farm9.staticflickr.com/8216/8434969557_d37882c42d_z.jpg",
    "http://farm6.staticflickr.com/5142/5835678453_4f3a4edb45_z.jpg",
]
# get each uri as bytes
image_bytes = [requests.get(uri).content for uri in uris]
table.add(
    [{"label": labels, "image_uri": uris, "image_bytes": image_bytes}]
)

Now we can search using text from both the default vector column and the custom vector column


# text search
actual = table.search("man's best friend").limit(1).to_pydantic(Images)[0]
print(actual.label) # prints "dog"

frombytes = (
    table.search("man's best friend", vector_column_name="vec_from_bytes")
    .limit(1)
    .to_pydantic(Images)[0]
)
print(frombytes.label)

Because we're using a multi-modal embedding function, we can also search using images

# image search
query_image_uri = "http://farm1.staticflickr.com/200/467715466_ed4a31801f_z.jpg"
image_bytes = requests.get(query_image_uri).content
query_image = Image.open(io.BytesIO(image_bytes))
actual = table.search(query_image).limit(1).to_pydantic(Images)[0]
print(actual.label == "dog")

# image search using a custom vector column
other = (
    table.search(query_image, vector_column_name="vec_from_bytes")
    .limit(1)
    .to_pydantic(Images)[0]
)
print(actual.label)

If you have any questions about the embeddings API, supported models, or see a relevant model missing, please raise an issue on GitHub.

11 KiB Raw Blame History