diff --git a/docs/src/notebooks/reproducibility.ipynb b/docs/src/notebooks/reproducibility.ipynb new file mode 100644 index 00000000..e72aa1cc --- /dev/null +++ b/docs/src/notebooks/reproducibility.ipynb @@ -0,0 +1,1167 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c0de1e6a-61f7-4f99-a2fd-1461902ab36a", + "metadata": {}, + "source": [ + "# Reproducible AI with LanceDB\n", + "\n", + "Reproducibility is critical for AI. For code, it's easy to keep track of changes using Github or Gitlab.\n", + "For data, it's not as easy. Most of the time, we're manually writing complicated data tracking code, wrestling with an external tool, and dealing with expensive duplicate snapshot copies with low granularity.\n", + "\n", + "For vector databases, if we make a mistake, we have to blow away the index, correct the mistake, and then completely rebuild it. It's difficult to rollback mistakes and it destroys any historical paper trail to debug and diagnose errors.\n", + "\n", + "LanceDB is the first and only vector database that supports full reproducibility natively.\n", + "Taking advantage of Lance columnar format, LanceDB supports:\n", + "- automatic versioning\n", + "- instance rollback\n", + "- appends, updates, deletions\n", + "- schema evolution\n", + "\n", + "To make auditing, tracking, and reproducibility a breeze.\n", + "\n", + "Let's see how this all works" + ] + }, + { + "cell_type": "markdown", + "id": "cafebbce-d324-485d-90ec-503695875f47", + "metadata": {}, + "source": [ + "## Pickle Rick!" + ] + }, + { + "cell_type": "markdown", + "id": "0e74818f-109e-4b09-b5f8-dd1875c512e3", + "metadata": {}, + "source": [ + "We'll start with a local LanceDB connection" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1f57d988-56b9-4384-8a7b-000d5f91034a", + "metadata": {}, + "outputs": [], + "source": [ + "import lancedb\n", + "db = lancedb.connect(\"~/.lancedb\")" + ] + }, + { + "cell_type": "markdown", + "id": "9c4c443d-2f14-455d-b766-bacbaad43d20", + "metadata": {}, + "source": [ + "We've got a CSV file with a bunch of quotes from Rick and Morty" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "08556aeb-6bdc-451c-99f5-163374fdec55", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "id,quote,author\n", + "1,\"Nobody exists on purpose. Nobody belongs anywhere.\",Morty\n", + "2,\"We're all going to die. Come watch TV.\",Morty\n", + "3,\"Losers look stuff up while the rest of us are carpin' all them diems.\",Summer\n", + "4,\"He's not a hot girl. He can't just bail on his life and set up shop in someone else's.\",Beth\n", + "5,\"When you are an a—hole, it doesn't matter how right you are. Nobody wants to give you the satisfaction.\",Morty\n", + "6,\"God's turning people into insect monsters, Beth. I'm the one beating them to death. Thank me.\",Jerry\n", + "7,\"Camping is just being homeless without the change.\",Summer\n", + "8,\"This seems like a good time for a drink and a cold, calculated speech with sinister overtones. A speech about politics, about order, brotherhood, power ... but speeches are for campaigning. Now is the time for action.\",Morty\n", + "9,\"Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.\",Mr. Meeseeks\n" + ] + } + ], + "source": [ + "!head rick_and_morty_quotes.csv" + ] + }, + { + "cell_type": "markdown", + "id": "a5fcdcda-b0fe-4ac4-90b4-6b42cf2ef34d", + "metadata": {}, + "source": [ + "Let's load this into a pandas dataframe.\n", + "\n", + "It's got 3 columns, a quote id, the quote string, and the first name of the author of the quote:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "def3ae59-77d9-43f0-ba6d-415a1503856b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idquoteauthor
01Nobody exists on purpose. Nobody belongs anywh...Morty
12We're all going to die. Come watch TV.Morty
23Losers look stuff up while the rest of us are ...Summer
34He's not a hot girl. He can't just bail on his...Beth
45When you are an a—hole, it doesn't matter how ...Morty
\n", + "
" + ], + "text/plain": [ + " id quote author\n", + "0 1 Nobody exists on purpose. Nobody belongs anywh... Morty\n", + "1 2 We're all going to die. Come watch TV. Morty\n", + "2 3 Losers look stuff up while the rest of us are ... Summer\n", + "3 4 He's not a hot girl. He can't just bail on his... Beth\n", + "4 5 When you are an a—hole, it doesn't matter how ... Morty" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"rick_and_morty_quotes.csv\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "4ba9ffac-c779-49e3-91a7-f1c00f3fda41", + "metadata": {}, + "source": [ + "Creating a LanceDB table from a pandas dataframe is straightforward using `create_table`" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "bd981f6d-b921-4b1d-b63a-6c1d59f3a51d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idquoteauthor
01Nobody exists on purpose. Nobody belongs anywh...Morty
12We're all going to die. Come watch TV.Morty
23Losers look stuff up while the rest of us are ...Summer
34He's not a hot girl. He can't just bail on his...Beth
45When you are an a—hole, it doesn't matter how ...Morty
\n", + "
" + ], + "text/plain": [ + " id quote author\n", + "0 1 Nobody exists on purpose. Nobody belongs anywh... Morty\n", + "1 2 We're all going to die. Come watch TV. Morty\n", + "2 3 Losers look stuff up while the rest of us are ... Summer\n", + "3 4 He's not a hot girl. He can't just bail on his... Beth\n", + "4 5 When you are an a—hole, it doesn't matter how ... Morty" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "db.drop_table(\"rick_and_morty\", ignore_missing=True)\n", + "table = db.create_table(\"rick_and_morty\", df)\n", + "table.head().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "38d055be-ae3e-4190-b1cf-abf14cdf8975", + "metadata": {}, + "source": [ + "## Updates" + ] + }, + { + "cell_type": "markdown", + "id": "842550fb-da81-44ea-9e98-d5dbaa6916c7", + "metadata": {}, + "source": [ + "Now, since Rick is the smartest man in the multiverse, he deserves to have his quotes attributed to his full name: Richard Daniel Sanchez.\n", + "\n", + "This can be done via `LanceTable.update`. It needs two arguments:\n", + "\n", + "1. A `where` string filter (sql syntax) to determine the rows to update\n", + "2. A dict of `values` where the keys are the column names to update and the values are the new values" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "9eac4708-a8c4-49aa-bc13-8e60c5bf34a0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idquoteauthor
01Nobody exists on purpose. Nobody belongs anywh...Morty
12We're all going to die. Come watch TV.Morty
23Losers look stuff up while the rest of us are ...Summer
34He's not a hot girl. He can't just bail on his...Beth
45When you are an a—hole, it doesn't matter how ...Morty
............
5657If I let you make me nervous, then we can't ge...Richard Daniel Sanchez
5758Oh, boy, so you actually learned something tod...Richard Daniel Sanchez
5859I can't abide bureaucracy. I don't like being ...Richard Daniel Sanchez
5960I think you have to think ahead and live in th...Richard Daniel Sanchez
6061I know that new situations can be intimidating...Richard Daniel Sanchez
\n", + "

61 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " id quote \\\n", + "0 1 Nobody exists on purpose. Nobody belongs anywh... \n", + "1 2 We're all going to die. Come watch TV. \n", + "2 3 Losers look stuff up while the rest of us are ... \n", + "3 4 He's not a hot girl. He can't just bail on his... \n", + "4 5 When you are an a—hole, it doesn't matter how ... \n", + ".. .. ... \n", + "56 57 If I let you make me nervous, then we can't ge... \n", + "57 58 Oh, boy, so you actually learned something tod... \n", + "58 59 I can't abide bureaucracy. I don't like being ... \n", + "59 60 I think you have to think ahead and live in th... \n", + "60 61 I know that new situations can be intimidating... \n", + "\n", + " author \n", + "0 Morty \n", + "1 Morty \n", + "2 Summer \n", + "3 Beth \n", + "4 Morty \n", + ".. ... \n", + "56 Richard Daniel Sanchez \n", + "57 Richard Daniel Sanchez \n", + "58 Richard Daniel Sanchez \n", + "59 Richard Daniel Sanchez \n", + "60 Richard Daniel Sanchez \n", + "\n", + "[61 rows x 3 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.update(where=\"author='Rick'\", values={\"author\": \"Richard Daniel Sanchez\"})\n", + "table.to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "ac6499ce-af6d-4934-9051-be5f159ce623", + "metadata": {}, + "source": [ + "## Schema evolution" + ] + }, + { + "cell_type": "markdown", + "id": "0402226b-6d0c-41c5-9257-069c4bf16825", + "metadata": {}, + "source": [ + "Ok so this is a vector database, so we need actual vectors.\n", + "We'll use sentence transformers here to avoid having to deal with api keys and all that." + ] + }, + { + "cell_type": "markdown", + "id": "85db4ed9-8f80-4b56-9867-1381fa1c4c7d", + "metadata": {}, + "source": [ + "Let's create a basic model using the \"all-MiniLM-L6-v2\" model and embed the quotes" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "998f4eb5-31cd-49ae-9f7c-2ec4d6652ef6", + "metadata": {}, + "outputs": [], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "model = SentenceTransformer(\"all-MiniLM-L6-v2\", device=\"cpu\")\n", + "vectors = model.encode(df.quote.values.tolist(),\n", + " convert_to_numpy=True,\n", + " normalize_embeddings=True).tolist()" + ] + }, + { + "cell_type": "markdown", + "id": "539e2a0e-529b-439b-ba8c-a388907c4860", + "metadata": {}, + "source": [ + "We can then convert the vectors into a pyarrow Table and merge it to the LanceDB Table.\n", + "\n", + "For the merge to work successfully, we need to have an overlapping column. Here the natural choice is to use the id column" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ccbea593-85cf-484c-989f-9836a31c7906", + "metadata": {}, + "outputs": [], + "source": [ + "from lance.vector import vec_to_table\n", + "import numpy as np\n", + "import pyarrow as pa" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "727c8230-7e41-436a-8666-60ee46e7041b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
vectorid
0[0.044295214, -0.08318844, -0.03597768, -0.039...1
1[0.05740536, -0.09669638, 0.005153852, -0.0213...2
2[0.05789702, -0.033441003, 0.013766681, -0.015...3
3[0.038649272, 0.012864259, -0.032611616, 0.019...4
4[0.076334454, 0.034511875, -0.0037649572, 0.02...5
\n", + "
" + ], + "text/plain": [ + " vector id\n", + "0 [0.044295214, -0.08318844, -0.03597768, -0.039... 1\n", + "1 [0.05740536, -0.09669638, 0.005153852, -0.0213... 2\n", + "2 [0.05789702, -0.033441003, 0.013766681, -0.015... 3\n", + "3 [0.038649272, 0.012864259, -0.032611616, 0.019... 4\n", + "4 [0.076334454, 0.034511875, -0.0037649572, 0.02... 5" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embeddings = vec_to_table(vectors)\n", + "embeddings = embeddings.append_column(\"id\", pa.array(np.arange(len(table))+1))\n", + "embeddings.to_pandas().head()" + ] + }, + { + "cell_type": "markdown", + "id": "518da48d-6481-4c1e-8ba4-800d5e0542cf", + "metadata": {}, + "source": [ + "And now we'll use the `LanceTable.merge` function to add the vector column into the LanceTable." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a4326a70-9863-47e8-8f3f-565e35d558cf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idquoteauthorvector
01Nobody exists on purpose. Nobody belongs anywh...Morty[0.044295214, -0.08318844, -0.03597768, -0.039...
12We're all going to die. Come watch TV.Morty[0.05740536, -0.09669638, 0.005153852, -0.0213...
23Losers look stuff up while the rest of us are ...Summer[0.05789702, -0.033441003, 0.013766681, -0.015...
34He's not a hot girl. He can't just bail on his...Beth[0.038649272, 0.012864259, -0.032611616, 0.019...
45When you are an a—hole, it doesn't matter how ...Morty[0.076334454, 0.034511875, -0.0037649572, 0.02...
\n", + "
" + ], + "text/plain": [ + " id quote author \\\n", + "0 1 Nobody exists on purpose. Nobody belongs anywh... Morty \n", + "1 2 We're all going to die. Come watch TV. Morty \n", + "2 3 Losers look stuff up while the rest of us are ... Summer \n", + "3 4 He's not a hot girl. He can't just bail on his... Beth \n", + "4 5 When you are an a—hole, it doesn't matter how ... Morty \n", + "\n", + " vector \n", + "0 [0.044295214, -0.08318844, -0.03597768, -0.039... \n", + "1 [0.05740536, -0.09669638, 0.005153852, -0.0213... \n", + "2 [0.05789702, -0.033441003, 0.013766681, -0.015... \n", + "3 [0.038649272, 0.012864259, -0.032611616, 0.019... \n", + "4 [0.076334454, 0.034511875, -0.0037649572, 0.02... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.merge(embeddings, left_on=\"id\")\n", + "table.head().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "f590fec8-0ed0-4148-b940-c81abe7b421c", + "metadata": {}, + "source": [ + "If we look at the schema, we see that `all-MiniLM-L6-v2` produces 384-dimensional vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ca9596a0-b4a0-4a5e-8d9e-967cd13b1eae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "id: int64\n", + "quote: string\n", + "author: string\n", + "vector: fixed_size_list[384]\n", + " child 0, item: float" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.schema" + ] + }, + { + "cell_type": "markdown", + "id": "f046002c-872c-4c39-ab85-e03c3b45b477", + "metadata": {}, + "source": [ + "## Rollback\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "dbfc298c-ada2-411b-925f-e53dc9d35f3c", + "metadata": {}, + "source": [ + "Suppose we used the table and found that the `all-MiniLM-L6-v2` model doesn't produce ideal results. Instead we want to try a larger model. How do we use the new embeddings without losing the change history?" + ] + }, + { + "cell_type": "markdown", + "id": "dfb116e4-b3b2-4b7e-bbf8-d3e63ca2aa14", + "metadata": {}, + "source": [ + "First, major operations are automatically versioned in LanceDB.\n", + "Version 1 is the table creation. This contains no rows but just records the schema and metadata.\n", + "Version 2 is the initial insertion of data.\n", + "Versions 3 and 4 represents the update (deletion + append)\n", + "Version 5 is adding the new column." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a411902b-43d0-4889-8e34-bc5f3c409726", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'version': 1,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 44, 171997),\n", + " 'metadata': {}},\n", + " {'version': 2,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 44, 190897),\n", + " 'metadata': {}},\n", + " {'version': 3,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 45, 449369),\n", + " 'metadata': {}},\n", + " {'version': 4,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 45, 462049),\n", + " 'metadata': {}},\n", + " {'version': 5,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 53, 793029),\n", + " 'metadata': {}}]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.list_versions()" + ] + }, + { + "cell_type": "markdown", + "id": "7bd5e954-ac0f-4973-81c6-ad6120412d40", + "metadata": {}, + "source": [ + "We can restore version 4, before we added the old vector column" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "ad0682cc-7599-459c-bbd8-1cd1f296c845", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idquoteauthor
01Nobody exists on purpose. Nobody belongs anywh...Morty
12We're all going to die. Come watch TV.Morty
23Losers look stuff up while the rest of us are ...Summer
34He's not a hot girl. He can't just bail on his...Beth
45When you are an a—hole, it doesn't matter how ...Morty
\n", + "
" + ], + "text/plain": [ + " id quote author\n", + "0 1 Nobody exists on purpose. Nobody belongs anywh... Morty\n", + "1 2 We're all going to die. Come watch TV. Morty\n", + "2 3 Losers look stuff up while the rest of us are ... Summer\n", + "3 4 He's not a hot girl. He can't just bail on his... Beth\n", + "4 5 When you are an a—hole, it doesn't matter how ... Morty" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.restore(4)\n", + "table.head().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "b0a51146-40d0-4f16-9555-5ce68c2c9eee", + "metadata": {}, + "source": [ + "Notice that we now have one more, not less versions. When we restore an old version, we're not deleting the version history, we're just creating a new version where the schema and data is equivalent to the restored old version. In this way, we can keep track of all of the changes and always rollback to a previous state." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d5bfb448-20b9-45e9-90ba-8a73abb86668", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'version': 1,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 44, 171997),\n", + " 'metadata': {}},\n", + " {'version': 2,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 44, 190897),\n", + " 'metadata': {}},\n", + " {'version': 3,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 45, 449369),\n", + " 'metadata': {}},\n", + " {'version': 4,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 45, 462049),\n", + " 'metadata': {}},\n", + " {'version': 5,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 54, 53, 793029),\n", + " 'metadata': {}},\n", + " {'version': 6,\n", + " 'timestamp': datetime.datetime(2023, 9, 6, 1, 55, 4, 264152),\n", + " 'metadata': {}}]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.list_versions()" + ] + }, + { + "cell_type": "markdown", + "id": "6713cb53-8cb9-4235-9c55-337c311f0af6", + "metadata": {}, + "source": [ + "### Switching Models\n", + "\n", + "Now we'll switch to the `all-mpnet-base-v2` model and add the vectors to the restored dataset again" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "1fa2950d-3002-4903-b6c3-2760ce60d079", + "metadata": {}, + "outputs": [], + "source": [ + "model = SentenceTransformer(\"all-mpnet-base-v2\", device=\"cpu\")\n", + "vectors = model.encode(df.quote.values.tolist(),\n", + " convert_to_numpy=True,\n", + " normalize_embeddings=True).tolist()\n", + "embeddings = vec_to_table(vectors)\n", + "embeddings = embeddings.append_column(\"id\", pa.array(np.arange(len(table))+1))\n", + "table.merge(embeddings, left_on=\"id\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "694c46e0-a1c3-4869-a1eb-562f14606ad4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "id: int64\n", + "quote: string\n", + "author: string\n", + "vector: fixed_size_list[768]\n", + " child 0, item: float" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.schema" + ] + }, + { + "cell_type": "markdown", + "id": "5e4085a5-a2e7-4520-acfc-eabaae2caa7d", + "metadata": {}, + "source": [ + "## Deletion\n", + "\n", + "What if the whole show was just Rick-isms? \n", + "Let's delete any quote not said by Rick" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "9d11ddf1-b352-496c-91d7-99c70cbf304b", + "metadata": {}, + "outputs": [], + "source": [ + "table.delete(\"author != 'Richard Daniel Sanchez'\")" + ] + }, + { + "cell_type": "markdown", + "id": "77d2f591-e492-423e-b995-2a18ae8cb831", + "metadata": {}, + "source": [ + "We can see that the number of rows has been reduced to 30" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "20bcce48-a5df-43c7-9ab9-7d59a83055e9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "30" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(table)" + ] + }, + { + "cell_type": "markdown", + "id": "ef8457b2-1228-4a25-824e-477a07681b48", + "metadata": {}, + "source": [ + "Ok we had our fun, let's get back to the full quote set" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "6e279635-75b0-400c-8b43-4aa069282ccd", + "metadata": {}, + "outputs": [], + "source": [ + "table.restore(7)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "6a65b627-57a2-43b2-8acc-3805591845ad", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "61" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(table)" + ] + }, + { + "cell_type": "markdown", + "id": "ae1a6ee8-8868-49de-82ab-17a0f61f3a47", + "metadata": {}, + "source": [ + "## History\n", + "\n", + "We now have 9 versions in the data. We can review the operations that corresponds to each version below:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "f595c9b8-91ec-48c1-9790-c40e1bd24b60", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.version" + ] + }, + { + "cell_type": "markdown", + "id": "774f4eb0-03d4-4fda-a825-6217bf096619", + "metadata": {}, + "source": [ + "\n", + "Versions:\n", + "- 1 - Create\n", + "- 2 - Append\n", + "- 3 - Update (deletion)\n", + "- 4 - Update (append)\n", + "- 5 - Merge (vector column)\n", + "- 6 - Restore (4)\n", + "- 7 - Merge (new vector column)\n", + "- 8 - Deletion\n", + "- 9 - Restore" + ] + }, + { + "cell_type": "markdown", + "id": "fb0131e6-2b73-442a-b4c6-6976a9cf4c7e", + "metadata": {}, + "source": [ + "## Summary" + ] + }, + { + "cell_type": "markdown", + "id": "97a1cf79-b46b-40cd-ada0-54edef358627", + "metadata": {}, + "source": [ + "We never had to explicitly manage the versioning. And we never had to create expensive and slow snapshots. LanceDB automatically tracks the full history of operations I created and supports fast rollbacks. In production this is critical for debugging issues and minimizing downtime by rolling back to a previously successful state in seconds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28ae7a98-e9ce-41e7-a789-f396560b540a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/src/notebooks/rick_and_morty_quotes.csv b/docs/src/notebooks/rick_and_morty_quotes.csv new file mode 100644 index 00000000..42d0c63d --- /dev/null +++ b/docs/src/notebooks/rick_and_morty_quotes.csv @@ -0,0 +1,62 @@ +id,quote,author +1,"Nobody exists on purpose. Nobody belongs anywhere.",Morty +2,"We're all going to die. Come watch TV.",Morty +3,"Losers look stuff up while the rest of us are carpin' all them diems.",Summer +4,"He's not a hot girl. He can't just bail on his life and set up shop in someone else's.",Beth +5,"When you are an a—hole, it doesn't matter how right you are. Nobody wants to give you the satisfaction.",Morty +6,"God's turning people into insect monsters, Beth. I'm the one beating them to death. Thank me.",Jerry +7,"Camping is just being homeless without the change.",Summer +8,"This seems like a good time for a drink and a cold, calculated speech with sinister overtones. A speech about politics, about order, brotherhood, power ... but speeches are for campaigning. Now is the time for action.",Morty +9,"Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.",Mr. Meeseeks +10,"If I've learned one thing, it's that before you get anywhere in life, you gotta stop listening to yourself.",Jerry +11,"I just want to go back to Hell, where everyone thinks I'm smart and funny.",Mr. Needful +12,"Hi Mr. Jellybean, I'm Morty. I’m on an adventure with my grandpa.",Morty +13,"You're not the cause of your parents' misery. You're just a symptom of it.",Summer +14,"Don't deify the people who leave you.",Beth +15,"Well, then get your s—t together, get it all together, and put it in a backpack, all your s—t, so it's together. And if you gotta take it somewhere, take it somewhere, you know, take it to the s—t store and sell it, or put it in the s—t museum. I don't care what you do, you just gotta get it together. Get your s—t together.",Morty +16,"At least the devil has a job!",Summer +17,"Life is effort and I'll stop when I die!",Jerry +18,"I just killed my family! I don't care what they were!",Morty +19,"It's funny to say they are small. It's funny to say they are big.",Shrimply Pibbles +20,"You're holding me verbally hostage.",Summer +21,"Honey, stop raising your father's cholesterol so you can take a hot funeral selfie.",Beth +22,"Rick, when you say you made an exact replica of the house, did you mean, like, an exact replica?",Morty +23,"Give a gun to the lady who got pregnant with me too early and constantly makes it our problem.",Summer +24,"Say goodbye to your precious dry land! For soon it will be wet!",Mr. Nimbus +25,"Nobody's smarter than Rick, but nobody else is my dad. You're a genius at that.",Morty +26,"B—h, my generation gets traumatized for breakfast.",Summer +27,"Inception made sense!",Morty +28,"I realize now I'm attracted to you for the same reason I can't be with you: You can't change. And I have no problem with that, but it clearly means I have a problem with myself.",Unity +29,"Mr. President, if I've learned one thing today, it's that sometimes you have to not give a f—k!",Morty +30,"I didn't know freedom meant people doing stuff that sucks.",Summer +31,"How many of these are just horrible mistakes I made? I mean, maybe I'd stop making so many if I let myself learn from them.",Morty +32,"I'm a scientist because I invent, transform, create, and destroy for a living. And when I don't like something about the world, I change it.",Rick +33,"Wubba lubba dub dub!",Rick +34,"I turned myself into a pickle, Morty! I'm Pickle Rick!",Rick +35,"I know about the Yosemite T-shirt, Morty.",Rick +36,"The universe is basically an animal. It grazes on the ordinary. It creates infinite idiots just to eat them.",Rick +37,"If I die in a cage, I lose a bet.",Rick +38,"Sometimes science is more art than science.",Rick +39,"To live is to risk it all—otherwise, you're just an inert chunk of randomly assembled molecules drifting wherever the universe blows you.",Rick +40,"Welcome to the club, pal.",Rick +41,"So I have an emo streak. It's part of what makes me so rad.",Rick +42,"Listen, I'm not the nicest guy in the universe, because I'm the smartest, and being nice is something stupid people do to hedge their bets.",Rick +43,"Wait a minute! Is that Mountain Dew in my quantum-transport-solution?",Rick +44,"Listen, Morty, I hate to break it to you, but what people call 'love' is just a chemical reaction that compels animals to breed.",Rick +45,"Break the cycle, Morty. Rise above. Focus on science.",Rick +46,"Don't get drawn into the culture, Morty. Stealing stuff is about the stuff, not the stealing.",Rick +47,"I'm sorry, but your opinion means very little to me.",Rick +48,"You don't get to tell anyone what's sad. You’re like a one-man Mount Sadmore. So I guess like a Lincoln Sadmorial.",Rick +49,"This pickle doesn't care about your children. I'm not gonna take their dreams. I'm gonna take their parents.",Rick +50,"I programmed you to believe that.",Rick +51,"Have fun with empowerment. It seems to make everyone that gets it really happy.",Rick +52,"Thanks, Mr. Poopybutthole. I always could count on you.",Rick +53,"Weddings are basically funerals with a cake.",Rick +54,"I mean, if you spend all day shuffling words around, you can make anything sound bad, Morty.",Rick +55,"It's your choice to take this personally.",Rick +56,"Excuse me, coming through. What are you here for? Just kidding, I don't care.",Rick +57,"If I let you make me nervous, then we can't get schwifty.",Rick +58,"Oh, boy, so you actually learned something today? What is this, Full House?",Rick +59,"I can't abide bureaucracy. I don't like being told where to go and what to do. I consider it a violation. Did you get those seeds all the way up your butt?",Rick +60,"I think you have to think ahead and live in the moment.",Rick +61,"I know that new situations can be intimidating. You're lookin' around and it's all scary and different, but you know, meeting them head-on, charging into 'em like a bull—that's how we grow as people.",Rick