Upserting YouTube transcripts

In this following exercises, you'll create a chatbot that can answer questions about YouTube videos by ingesting video transcripts and additional metadata into your 'pinecone-datacamp' index.

To start, you'll prepare data from the youtube_rag_data.csv file and upsert the vectors with all of their metadata into the 'pinecone-datacamp' index. The data is provided in the DataFrame youtube_df.

Here's an example transcript from the youtube_df DataFrame:

id: 
35Pdoyi6ZoQ-t0.0

title:
Training and Testing an Italian BERT - Transformers From Scratch #4

text: 
Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch 
mini series. So if you haven't been following along, we've essentially covered what 
you can see on the screen. So we got some data. We built a tokenizer with it...

url: 
https://youtu.be/35Pdoyi6ZoQ

published: 
01-01-2024

Initialize the Pinecone client with your API key (the OpenAI client is available as client).
Extract the 'id', 'text', 'title', 'url', and 'published' metadata from each row.
Encode texts using 'text-embedding-3-small' from OpenAI.
Upsert the vectors and metadatas to a namespace called 'youtube_rag_dataset'.

Exercise

Upserting YouTube transcripts

Instructions

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions

Exercise