diff --git a/docs/src/examples/nodejs.md b/docs/src/examples/nodejs.md index ec9a5131..d85c29eb 100644 --- a/docs/src/examples/nodejs.md +++ b/docs/src/examples/nodejs.md @@ -4,96 +4,136 @@ nodejs -This Q&A bot will allow you to search through youtube transcripts using natural language! We'll introduce how you can use LanceDB's Javascript API to store and manage your data easily. +This Q&A bot will allow you to search through youtube transcripts using natural language! We'll introduce how to use LanceDB's Javascript API to store and manage your data easily. -For this example we're using a HuggingFace dataset that contains YouTube transcriptions: `jamescalam/youtube-transcriptions`, to make it easier, we've converted it to a LanceDB `db` already, which you can download and put in a working directory: - -```wget -c https://eto-public.s3.us-west-2.amazonaws.com/lancedb_demo.tar.gz -O - | tar -xz -C .``` - -Now, we'll create a simple app that can: -1. Take a text based query and search for contexts in our corpus, using embeddings generated from the OpenAI Embedding API. -2. Create a prompt with the contexts, and call the OpenAI Completion API to answer the text based query. - -Dependencies and setup of OpenAI API: - -```javascript -const lancedb = require("vectordb"); -const { Configuration, OpenAIApi } = require("openai"); - -const configuration = new Configuration({ - apiKey: process.env.OPENAI_API_KEY, - }); -const openai = new OpenAIApi(configuration); +```bash +npm install vectordb ``` -First, let's set our question and the context amount. The context amount will be used to query similar documents in our corpus. +## Download the data -```javascript -const QUESTION = "who was the 12th person on the moon and when did they land?"; -const CONTEXT_AMOUNT = 3; +For this example, we're using a sample of a HuggingFace dataset that contains YouTube transcriptions: `jamescalam/youtube-transcriptions`. Download and extract this file under the `data` folder: + +```bash +wget -c https://eto-public.s3.us-west-2.amazonaws.com/datasets/youtube_transcript/youtube-transcriptions_sample.jsonl ``` -Now, let's generate an embedding from this question: +## Prepare Context + +Each item in the dataset contains just a short chunk of text. We'll need to merge a bunch of these chunks together on a rolling basis. For this demo, we'll look back 20 records to create a more complete context for each sentence. + +First, we need to read and parse the input file. ```javascript -const embeddingResponse = await openai.createEmbedding({ - model: "text-embedding-ada-002", - input: QUESTION, -}); +const lines = (await fs.readFile(INPUT_FILE_NAME, 'utf-8')) + .toString() + .split('\n') + .filter(line => line.length > 0) + .map(line => JSON.parse(line)) -const embedding = embeddingResponse.data["data"][0]["embedding"]; +const data = contextualize(lines, 20, 'video_id') ``` -Once we have the embedding, we can connect to LanceDB (using the database we downloaded earlier), and search through the chatbot table. -We'll extract 3 similar documents found. +The contextualize function groups the transcripts by video_id and then creates the expanded context for each item. ```javascript -const db = await lancedb.connect('./lancedb'); -const tbl = await db.openTable('chatbot'); -const query = tbl.search(embedding); -query.limit = CONTEXT_AMOUNT; -const context = await query.execute(); -``` +function contextualize (rows, contextSize, groupColumn) { + const grouped = [] + rows.forEach(row => { + if (!grouped[row[groupColumn]]) { + grouped[row[groupColumn]] = [] + } + grouped[row[groupColumn]].push(row) + }) -Let's combine the context together so we can pass it into our prompt: - -```javascript -for (let i = 1; i < context.length; i++) { - context[0]["text"] += " " + context[i]["text"]; + const data = [] + Object.keys(grouped).forEach(key => { + for (let i = 0; i < grouped[key].length; i++) { + const start = i - contextSize > 0 ? i - contextSize : 0 + grouped[key][i].context = grouped[key].slice(start, i + 1).map(r => r.text).join(' ') + } + data.push(...grouped[key]) + }) + return data } ``` -Lastly, let's construct the prompt. You could play around with this to create more accurate/better prompts to yield results. +## Create the LanceDB Table + +To load our data into LanceDB, we need to create embedding (vectors) for each item. For this example, we will use the OpenAI embedding functions, which have a native integration with LanceDB. ```javascript -const prompt = "Answer the question based on the context below.\n\n" + - "Context:\n" + - `${context[0]["text"]}\n` + - `\n\nQuestion: ${QUESTION}\nAnswer:`; +// You need to provide an OpenAI API key, here we read it from the OPENAI_API_KEY environment variable +const apiKey = process.env.OPENAI_API_KEY +// The embedding function will create embeddings for the 'context' column +const embedFunction = new lancedb.OpenAIEmbeddingFunction('context', apiKey) +// Connects to LanceDB +const db = await lancedb.connect('data/youtube-lancedb') +const tbl = await db.createTable('vectors', data, embedFunction) ``` -We pass the prompt, along with the context, to the completion API. +## Create and answer the prompt + +We will accept questions in natural language and use our corpus stored in LanceDB to answer them. First, we need to set up the OpenAI client: ```javascript -const completion = await openai.createCompletion({ - model: "text-davinci-003", - prompt, - temperature: 0, - max_tokens: 400, - top_p: 1, - frequency_penalty: 0, - presence_penalty: 0, -}); +const configuration = new Configuration({ apiKey }) +const openai = new OpenAIApi(configuration) ``` -And that's it! +Then we can prompt questions and use LanceDB to retrieve the three most relevant transcripts for this prompt. ```javascript -console.log(completion.data.choices[0].text); +const query = await rl.question('Prompt: ') +const results = await tbl + .search(query) + .select(['title', 'text', 'context']) + .limit(3) + .execute() ``` -The response is (which is non deterministic): +The query and the transcripts' context are appended together in a single prompt: +```javascript +function createPrompt (query, context) { + let prompt = + 'Answer the question based on the context below.\n\n' + + 'Context:\n' + + // need to make sure our prompt is not larger than max size + prompt = prompt + context.map(c => c.context).join('\n\n---\n\n').substring(0, 3750) + prompt = prompt + `\n\nQuestion: ${query}\nAnswer:` + return prompt +} ``` -The 12th person on the moon was Harrison Schmitt and he landed on December 11, 1972. + +We can now use the OpenAI Completion API to process our custom prompt and give us an answer. + +```javascript +const response = await openai.createCompletion({ + model: 'text-davinci-003', + prompt: createPrompt(query, results), + max_tokens: 400, + temperature: 0, + top_p: 1, + frequency_penalty: 0, + presence_penalty: 0 +}) +console.log(response.data.choices[0].text) ``` + +## Let's put it all together now + +Now we can provide queries and have them answered based on your local LanceDB data. + +```bash +Prompt: who was the 12th person on the moon and when did they land? + The 12th person on the moon was Harrison Schmitt and he landed on December 11, 1972. +Prompt: Which training method should I use for sentence transformers when I only have pairs of related sentences? + NLI with multiple negative ranking loss. +``` + +## That's a wrap + +In this example, you learned how to use LanceDB to store and query embedding representations of your local data. The complete example code is on [GitHub](https://github.com/lancedb/lancedb/tree/main/node/examples), and you can also download the LanceDB dataset using [this link](https://eto-public.s3.us-west-2.amazonaws.com/datasets/youtube_transcript/youtube-lancedb.zip). + diff --git a/docs/src/index.md b/docs/src/index.md index 3ff672ae..69fa2ff2 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -55,6 +55,9 @@ LanceDB's core is written in Rust 🦀 and is built using { + // You need to provide an OpenAI API key, here we read it from the OPENAI_API_KEY environment variable + const apiKey = process.env.OPENAI_API_KEY + // The embedding function will create embeddings for the 'context' column + const embedFunction = new lancedb.OpenAIEmbeddingFunction('context', apiKey) + + // Connects to LanceDB + const db = await lancedb.connect('data/youtube-lancedb') + + // Open the vectors table or create one if it does not exist + let tbl + if ((await db.tableNames()).includes('vectors')) { + tbl = await db.openTable('vectors', embedFunction) + } else { + tbl = await createEmbeddingsTable(db, embedFunction) + } + + // Use OpenAI Completion API to generate and answer based on the context that LanceDB provides + const configuration = new Configuration({ apiKey }) + const openai = new OpenAIApi(configuration) + const rl = readline.createInterface({ input, output }) + try { + while (true) { + const query = await rl.question('Prompt: ') + const results = await tbl + .search(query) + .select(['title', 'text', 'context']) + .limit(3) + .execute() + + // console.table(results) + + const response = await openai.createCompletion({ + model: 'text-davinci-003', + prompt: createPrompt(query, results), + max_tokens: 400, + temperature: 0, + top_p: 1, + frequency_penalty: 0, + presence_penalty: 0 + }) + console.log(response.data.choices[0].text) + } + } catch (err) { + console.log('Error: ', err) + } finally { + rl.close() + } + process.exit(1) +})() + +async function createEmbeddingsTable (db, embedFunction) { + console.log(`Creating embeddings from ${INPUT_FILE_NAME}`) + // read the input file into a JSON array, skipping empty lines + const lines = (await fs.readFile(INPUT_FILE_NAME, 'utf-8')) + .toString() + .split('\n') + .filter(line => line.length > 0) + .map(line => JSON.parse(line)) + + const data = contextualize(lines, 20, 'video_id') + return await db.createTable('vectors', data, embedFunction) +} + +// Each transcript has a small text column, we include previous transcripts in order to +// have more context information when creating embeddings +function contextualize (rows, contextSize, groupColumn) { + const grouped = [] + rows.forEach(row => { + if (!grouped[row[groupColumn]]) { + grouped[row[groupColumn]] = [] + } + grouped[row[groupColumn]].push(row) + }) + + const data = [] + Object.keys(grouped).forEach(key => { + for (let i = 0; i < grouped[key].length; i++) { + const start = i - contextSize > 0 ? i - contextSize : 0 + grouped[key][i].context = grouped[key].slice(start, i + 1).map(r => r.text).join(' ') + } + data.push(...grouped[key]) + }) + return data +} + +// Creates a prompt by aggregating all relevant contexts +function createPrompt (query, context) { + let prompt = + 'Answer the question based on the context below.\n\n' + + 'Context:\n' + + // need to make sure our prompt is not larger than max size + prompt = prompt + context.map(c => c.context).join('\n\n---\n\n').substring(0, 3750) + prompt = prompt + `\n\nQuestion: ${query}\nAnswer:` + return prompt +} diff --git a/node/examples/js-youtube-transcripts/package.json b/node/examples/js-youtube-transcripts/package.json new file mode 100644 index 00000000..12f44bb5 --- /dev/null +++ b/node/examples/js-youtube-transcripts/package.json @@ -0,0 +1,15 @@ +{ + "name": "vectordb-example-js-openai", + "version": "1.0.0", + "description": "", + "main": "index.js", + "scripts": { + "test": "echo \"Error: no test specified\" && exit 1" + }, + "author": "Lance Devs", + "license": "Apache-2.0", + "dependencies": { + "vectordb": "file:../..", + "openai": "^3.2.1" + } +}