1. Learn
  2. /
  3. Courses
  4. /
  5. Multi-Modal Models with Hugging Face

Connected

Exercise

Video sentiment analysis with CLIP CLAP

Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).

The video (video) and the corresponding audio (audio_sample) you created previously are still available:

Frames from the Bounce TV commercial

A list of emotions has been loaded as emotions.

Instructions

100 XP
  • Make an audio classifier pipeline for zero-shot-audio-classification using the laion/clap-htsat-unfused model.
  • Make an image classifier pipeline for zero-shot-image-classification using the openai/clip-vit-base-patch32 model (a smaller variant of what we used in the video).
  • Use the image classifier pipeline to generate predictions for each image in the video.
  • Use the audio classifier pipeline to generate predictions for the audio_sample.