Video sentiment analysis with CLIP CLAP

Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).

The video (video) and the corresponding audio (audio_sample) you created previously are still available:

Frames from the Bounce TV commercial

A list of emotions has been loaded as emotions.

Make an audio classifier pipeline for zero-shot-audio-classification using the laion/clap-htsat-unfused model.
Make an image classifier pipeline for zero-shot-image-classification using the openai/clip-vit-base-patch32 model (a smaller variant of what we used in the video).
Use the image classifier pipeline to generate predictions for each image in the video.
Use the audio classifier pipeline to generate predictions for the audio_sample.

Exercise

Video sentiment analysis with CLIP CLAP

Instructions

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions

Exercise