Video sentiment analysis with CLIP CLAP
Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).
The video (video
) and the corresponding audio (audio_sample
) you created previously is still available:
A list of emotions has been loaded as emotions
.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Make an audio classifier pipeline for
zero-shot-audio-classification
using thelaion/clap-htsat-unfused
model. - Make an image classifier pipeline for
zero-shot-image-classification
using theopenai/clip-vit-base-patch32
model (a smaller variant of what we used in the video). - Use the image classifier pipeline to generate predictions for each image in the video.
- Use the audio classifier pipeline to generate predictions for the
audio_sample
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make an audio classifier pipeline
audio_classifier = ____(model="____", task="____")
# Make an image classifier pipeline
image_classifier = ____(model="____", task="____")
# Create emotion scores for each video frame
predictions = image_classifier(video, candidate_labels=emotions)
scores = [
{l['label']: l['score'] for l in prediction}
for prediction in predictions
]
avg_image_scores = {emotion: sum([s[emotion] for s in scores])/len(scores) for emotion in emotions}
# Make audio scores
audio_scores = ____(____, candidate_labels=____)
audio_scores = {l['label']: l['score'] for l in audio_scores}
multimodal_scores = {emotion: (avg_image_scores[emotion] + audio_scores[emotion])/2 for emotion in emotions}
print(f"Multimodal scores: {multimodal_scores}")