Get startedGet started for free

Audio-visual emotion analysis

1. Audio-visual emotion analysis

Let's learn how we can make multi-modal classifications with audio and visual data.

2. Audio-visual emotion analysis

Consider two advertising agencies, where one company's advert is currently outperforming the other's. We could use audio-video emotion analysis to quantitatively compare the levels of emotion conveyed in both ads to see if it lines up with their differing performances. So how do we do this?

3. CLAP

The CLAP model is the audio equivalent of the CLIP image-text model we previously encountered, with separate text and audio encoders. The training dataset consists of 633k matched audio waveforms and their text descriptions. The contrastive pretraining is designed to align the encoded audio and its text values. In practice, this means that the model gives quantitative measures of how close a piece of text matches an audio waveform.

4. Approach: multimodal ZSL

The CLAP model provides the ability to perform zero-shot classifications with audio.

5. Approach: multimodal ZSL

We can add this to the visual zero-shot learning classification we already learned using CLIP. By separating the audio and visual information, and using shared text classes, we can combine the predictions from CLIP on the visual information and CLAP on the audio to perform audio-visual zero-shot learning.

6. Video and audio

MP4 is a common video file format, which contains separate streams for video and audio that are synchronized via timestamps. Using the MoviePy library, we can select a segment of the movie file with the ffmpeg_extract_subclip() function, specifying a start and end time, here, 0 to 5 seconds. MoviePy allows us to separate audio from video with ease. We load the MP4 file as a VideoFileClip object, and we separate the audio using the .audio attribute, which can then be written to file using the .write_audiofile() method to prepare it for use with CLAP.

7. Preparing the audio and video

To prepare the video file, we load the MP4 into a VideoReader object from the decord package and take the first 20 frames. In order to create PIL images from the arrays, we need to reverse the color channels, which is why the ::-1 slicing is required. To prepare the audio files, we create a Hugging Face dataset from a dictionary with the file path as the value for the audio keyword and then casting to the Audio class. This will then add an audio array entry.

8. Video predictions

To use a zero-shot pipeline, we first define a list of emotions. After loading our CLIP pipeline with the zero-shot-image-classification task, we can make predictions for each frame. The prediction returned by the model will be a list of dictionaries for each class with two entries, 'label' and 'score'. We can then average the scores across all frames to get overall emotion scores for the video images.

9. Audio predictions and combination

The audio pipeline with the CLAP pretrained model uses the zero-shot-audio-classification task. The syntax for inferencing is the same as for the images: calling the pipeline and collecting the labels and scores in a dictionary. To combine the two modalities into a single multimodal prediction, here, we take the average of the two scores. The model correctly assigns "joy" the highest confidence score!

10. Let's practice!

Let's practice multimodal zero-shot learning.