Video sentiment analysis with CLIP CLAP
Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).
The video (video) and the corresponding audio (audio_sample) you created previously is still available:

A list of emotions has been loaded as emotions.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Make an audio classifier pipeline for
zero-shot-audio-classificationusing thelaion/clap-htsat-unfusedmodel. - Make an image classifier pipeline for
zero-shot-image-classificationusing theopenai/clip-vit-base-patch32model (a smaller variant of what we used in the video). - Use the image classifier pipeline to generate predictions for each image in the video.
- Use the audio classifier pipeline to generate predictions for the
audio_sample.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make an audio classifier pipeline
audio_classifier = ____(model="____", task="____")
# Make an image classifier pipeline
image_classifier = ____(model="____", task="____")
# Create emotion scores for each video frame
predictions = image_classifier(video, candidate_labels=emotions)
scores = [
{l['label']: l['score'] for l in prediction}
for prediction in predictions
]
avg_image_scores = {emotion: sum([s[emotion] for s in scores])/len(scores) for emotion in emotions}
# Make audio scores
audio_scores = ____(____, candidate_labels=____)
audio_scores = {l['label']: l['score'] for l in audio_scores}
multimodal_scores = {emotion: (avg_image_scores[emotion] + audio_scores[emotion])/2 for emotion in emotions}
print(f"Multimodal scores: {multimodal_scores}")