1. Learn
  2. /
  3. Courses
  4. /
  5. Vector Databases for Embeddings with Pinecone

Connected

Exercise

Upserting YouTube transcripts

In this following exercises, you'll create a chatbot that can answer questions about YouTube videos by ingesting video transcripts and additional metadata into your 'pinecone-datacamp' index.

To start, you'll prepare data from the youtube_rag_data.csv file and upsert the vectors with all of their metadata into the 'pinecone-datacamp' index. The data is provided in the DataFrame youtube_df.

Here's an example transcript from the youtube_df DataFrame:

id: 
35Pdoyi6ZoQ-t0.0

title:
Training and Testing an Italian BERT - Transformers From Scratch #4

text: 
Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch 
mini series. So if you haven't been following along, we've essentially covered what 
you can see on the screen. So we got some data. We built a tokenizer with it...

url: 
https://youtu.be/35Pdoyi6ZoQ

published: 
01-01-2024

Instructions

100 XP
  • Initialize the Pinecone client with your API key (the OpenAI client is available as client).
  • Extract the 'id', 'text', 'title', 'url', and 'published' metadata from each row.
  • Encode texts using 'text-embedding-3-small' from OpenAI.
  • Upsert the vectors and metadatas to a namespace called 'youtube_rag_dataset'.