CLIP CLAP: Multi-modal emotion classification
Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).
The video (video
) and the corresponding audio (audio_sample
) you created previously is still available:
A list of emotions has been loaded as emotions
.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Make an audio classifier pipeline for
zero-shot-audio-classification
using thelaion/clap-htsat-unfused
model. - Make an image classifier pipeline for
zero-shot-image-classification
using theopenai/clip-vit-large-patch14
model. - Use the image classifier pipeline to generate predictions for each image in the video.
- Use the audio classifier pipeline to generate predictions for the
audio_sample
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Make an audio classifier pipeline
audio_classifier = ____(model=____, task=____)
# Make an image classifier pipeline
image_classifier = ____(model=____, task=____)
scores = []
for img in video:
# Make image scores
predictions = ____(____, candidate_labels=____)
scores.append({l['label']: l['score'] for l in predictions})
av_scores = {emotion: sum([s[emotion] for s in scores])/len(scores) for emotion in emotions}
# Make audio scores
audio_scores = ____(____, candidate_labels=____)
audio_scores = {l['label']: l['score'] for l in audio_scores}
multimodal_scores = {emotion: (av_scores[emotion] + audio_scores[emotion])/2 for emotion in emotions}
print(f"Multimodal scores: {multimodal_scores}")