Automatic speech recognition
In this exercise, you use AI to transcribe audio into text automatically! You'll be working with the VCTK Corpus again, which includes around 44-hours of speech uttered by English speakers with various accents. You'll use OpenAI's Whisper tiny model, which contains only 37M parameters to preprocess the VCTK audio data and generate the corresponding text.
The audio preprocessor (processor
) has been loaded, as has the WhisperForConditionalGeneration
module. A sample audio datapoint (sample
) has already been loaded.
Diese Übung ist Teil des Kurses
Multi-Modal Models with Hugging Face
Anleitung zur Übung
- Load the
WhisperForConditionalGeneration
pretrained model using theopenai/whisper-tiny
checkpoint. - Preprocess the
sample
datapoint with the required sampling rate of16000
. - Generate the tokens from the model using the
.input_features
attribute of the preprocessed inputs.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Load the pretrained model
model = ____
model.config.forced_decoder_ids=None
# Preprocess the sample audio
input_preprocessed = ____(____, sampling_rate=____, return_tensors="pt", return_attention_mask=True)
# Generate the IDs of the recognized tokens
predicted_ids = ____
transcription = processor.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)