1. Learn
  2. /
  3. Courses
  4. /
  5. Multi-Modal Models with Hugging Face

Connected

Exercise

Automatic speech recognition

In this exercise, you use AI to transcribe audio into text automatically! You'll be working with the VCTK Corpus again, which includes around 44-hours of speech uttered by English speakers with various accents. You'll use OpenAI's Whisper tiny model, which contains only 37M parameters to preprocess the VCTK audio data and generate the corresponding text.

The audio preprocessor (processor) has been loaded, as has the WhisperForConditionalGeneration module. A sample audio datapoint (sample) has already been loaded.

Instructions

100 XP
  • Load the WhisperForConditionalGeneration pretrained model using the openai/whisper-tiny checkpoint.
  • Preprocess the sample datapoint with the required sampling rate of 16000.
  • Generate the tokens from the model using the .input_features attribute of the preprocessed inputs.