Get startedGet started for free

CLIP CLAP: Multi-modal emotion classification

Now you'll perform the emotion analysis of the advertisement you previously prepared using CLIP/CLAP. In order to make a multi-modal classification of emotion, you will be combining the predictions of these models using the mean (known as late fusion).

The video (video) and the corresponding audio (audio_sample) you created previously is still available:

Frames from the Bounce TV commercial

A list of emotions has been loaded as emotions.

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Make an audio classifier pipeline for zero-shot-audio-classification using the laion/clap-htsat-unfused model.
  • Make an image classifier pipeline for zero-shot-image-classification using the openai/clip-vit-large-patch14 model.
  • Use the image classifier pipeline to generate predictions for each image in the video.
  • Use the audio classifier pipeline to generate predictions for the audio_sample.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Make an audio classifier pipeline
audio_classifier = ____(model=____, task=____)

# Make an image classifier pipeline
image_classifier = ____(model=____, task=____)

scores = []
for img in video:
	# Make image scores
    predictions = ____(____, candidate_labels=____)
    scores.append({l['label']: l['score'] for l in predictions})

av_scores = {emotion: sum([s[emotion] for s in scores])/len(scores) for emotion in emotions}

# Make audio scores
audio_scores = ____(____, candidate_labels=____)

audio_scores = {l['label']: l['score'] for l in audio_scores}
multimodal_scores = {emotion: (av_scores[emotion] + audio_scores[emotion])/2 for emotion in emotions}
print(f"Multimodal scores: {multimodal_scores}")
Edit and Run Code