# Langchain ![Illustration](../assets/langchain.png) ## Quick Start You can load your document data using langchain's loaders, for this example we are using `TextLoader` and `OpenAIEmbeddings` as the embedding model. Checkout Complete example here - [LangChain demo](../notebooks/langchain_example.ipynb) ```python import os from langchain.document_loaders import TextLoader from langchain.vectorstores import LanceDB from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import CharacterTextSplitter os.environ["OPENAI_API_KEY"] = "sk-..." loader = TextLoader("../../modules/state_of_the_union.txt") # Replace with your data path documents = loader.load() documents = CharacterTextSplitter().split_documents(documents) embeddings = OpenAIEmbeddings() docsearch = LanceDB.from_documents(documents, embeddings) query = "What did the president say about Ketanji Brown Jackson" docs = docsearch.similarity_search(query) print(docs[0].page_content) ``` ## Documentation In the above example `LanceDB` vector store class object is created using `from_documents()` method which is a `classmethod` and returns the initialized class object. You can also use `LanceDB.from_texts(texts: List[str],embedding: Embeddings)` class method. The exhaustive list of parameters for `LanceDB` vector store are : - `connection`: (Optional) `lancedb.db.LanceDBConnection` connection object to use. If not provided, a new connection will be created. - `embedding`: Langchain embedding model. - `vector_key`: (Optional) Column name to use for vector's in the table. Defaults to `'vector'`. - `id_key`: (Optional) Column name to use for id's in the table. Defaults to `'id'`. - `text_key`: (Optional) Column name to use for text in the table. Defaults to `'text'`. - `table_name`: (Optional) Name of your table in the database. Defaults to `'vectorstore'`. - `api_key`: (Optional) API key to use for LanceDB cloud database. Defaults to `None`. - `region`: (Optional) Region to use for LanceDB cloud database. Only for LanceDB Cloud, defaults to `None`. - `mode`: (Optional) Mode to use for adding data to the table. Defaults to `'overwrite'`. - `reranker`: (Optional) The reranker to use for LanceDB. - `relevance_score_fn`: (Optional[Callable[[float], float]]) Langchain relevance score function to be used. Defaults to `None`. ```python db_url = "db://lang_test" # url of db you created api_key = "xxxxx" # your API key region="us-east-1-dev" # your selected region vector_store = LanceDB( uri=db_url, api_key=api_key, #(dont include for local API) region=region, #(dont include for local API) embedding=embeddings, table_name='langchain_test' #Optional ) ``` ### Methods ##### add_texts() - `texts`: `Iterable` of strings to add to the vectorstore. - `metadatas`: Optional `list[dict()]` of metadatas associated with the texts. - `ids`: Optional `list` of ids to associate with the texts. - `kwargs`: `Any` This method adds texts and stores respective embeddings automatically. ```python vector_store.add_texts(texts = ['test_123'], metadatas =[{'source' :'wiki'}]) #Additionaly, to explore the table you can load it into a df or save it in a csv file: tbl = vector_store.get_table() print("tbl:", tbl) pd_df = tbl.to_pandas() pd_df.to_csv("docsearch.csv", index=False) # you can also create a new vector store object using an older connection object: vector_store = LanceDB(connection=tbl, embedding=embeddings) ``` ##### create_index() - `col_name`: `Optional[str] = None` - `vector_col`: `Optional[str] = None` - `num_partitions`: `Optional[int] = 256` - `num_sub_vectors`: `Optional[int] = 96` - `index_cache_size`: `Optional[int] = None` This method creates an index for the vector store. For index creation make sure your table has enough data in it. An ANN index is ususally not needed for datasets ~100K vectors. For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index. ```python # for creating vector index vector_store.create_index(vector_col='vector', metric = 'cosine') # for creating scalar index(for non-vector columns) vector_store.create_index(col_name='text') ``` ##### similarity_search() - `query`: `str` - `k`: `Optional[int] = None` - `filter`: `Optional[Dict[str, str]] = None` - `fts`: `Optional[bool] = False` - `name`: `Optional[str] = None` - `kwargs`: `Any` Return documents most similar to the query without relevance scores ```python docs = docsearch.similarity_search(query) print(docs[0].page_content) ``` ##### similarity_search_by_vector() - `embedding`: `List[float]` - `k`: `Optional[int] = None` - `filter`: `Optional[Dict[str, str]] = None` - `name`: `Optional[str] = None` - `kwargs`: `Any` Returns documents most similar to the query vector. ```python docs = docsearch.similarity_search_by_vector(query) print(docs[0].page_content) ``` ##### similarity_search_with_score() - `query`: `str` - `k`: `Optional[int] = None` - `filter`: `Optional[Dict[str, str]] = None` - `kwargs`: `Any` Returns documents most similar to the query string with relevance scores, gets called by base class's `similarity_search_with_relevance_scores` which selects relevance score based on our `_select_relevance_score_fn`. ```python docs = docsearch.similarity_search_with_relevance_scores(query) print("relevance score - ", docs[0][1]) print("text- ", docs[0][0].page_content[:1000]) ``` ##### similarity_search_by_vector_with_relevance_scores() - `embedding`: `List[float]` - `k`: `Optional[int] = None` - `filter`: `Optional[Dict[str, str]] = None` - `name`: `Optional[str] = None` - `kwargs`: `Any` Return documents most similar to the query vector with relevance scores. Relevance score ```python docs = docsearch.similarity_search_by_vector_with_relevance_scores(query_embedding) print("relevance score - ", docs[0][1]) print("text- ", docs[0][0].page_content[:1000]) ``` ##### max_marginal_relevance_search() - `query`: `str` - `k`: `Optional[int] = None` - `fetch_k` : Number of Documents to fetch to pass to MMR algorithm, `Optional[int] = None` - `lambda_mult`: Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5. `float = 0.5` - `filter`: `Optional[Dict[str, str]] = None` - `kwargs`: `Any` Returns docs selected using the maximal marginal relevance(MMR). Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. Similarly, `max_marginal_relevance_search_by_vector()` function returns docs most similar to the embedding passed to the function using MMR. instead of a string query you need to pass the embedding to be searched for. ```python result = docsearch.max_marginal_relevance_search( query="text" ) result_texts = [doc.page_content for doc in result] print(result_texts) ## search by vector : result = docsearch.max_marginal_relevance_search_by_vector( embeddings.embed_query("text") ) result_texts = [doc.page_content for doc in result] print(result_texts) ``` ##### add_images() - `uris` : File path to the image. `List[str]`. - `metadatas` : Optional list of metadatas. `(Optional[List[dict]], optional)` - `ids` : Optional list of IDs. `(Optional[List[str]], optional)` Adds images by automatically creating their embeddings and adds them to the vectorstore. ```python vec_store.add_images(uris=image_uris) # here image_uris are local fs paths to the images. ```