Get startedGet started for free

Speech recognition and audio generation

1. Speech recognition and audio generation

Let's now explore audio models!

2. Speech

Human speech is composed of numerous intricate elements: pitch: akin to the note played on a musical instrument, different on average for males and females; stress: the emphasis placed on certain syllables, dependent on languages and accents, and also rhythm: the timing and pacing of speech, which is language and emotion dependent.

3. Automatic speech recognition

Both text and audio are sequential data types, as the ordering of individual words and sounds is crucial. The idea behind tasks such as audio-to-text, text-to-audio, or audio-to-audio is to encode the input data and then decode it into the required sequence. This opens up many possibilities, including automatic transcription, audio cleaning, or generating spoken audio from text.

4. Automatic speech recognition

The Whisper family of models from OpenAI are popular choices for automatic speech recognition. They have a range of sizes, including a tiny version trained on 680k hours of labeled audio. To use it, we need both the WhisperProcessor and WhisperForConditionalGeneration from transformers. Let's try transcribing audio with this model.

5. Automatic speech recognition

We'll work again with the VCTK dataset, first resampling the audio to 16000Hz frequency to match what the model expects. We can find this information on the model card. We'll select a data point from the dataset and define a processor to extract features from the sample. We use the .input_features attribute to extract the features from the preprocessed audio, and pass it to the .generate() method of the model. The processor's .batch_decode() method then converts the model output IDs to natural language. The skip_special_tokens flag ignores special tokens added by the processor and returns only the transcribed text.

6. Audio generation

Now let's move to generating audio, where we'll use a model to remove noise from an audio sample. Such models have three components: a preprocessor for standardization and feature extraction, a feature transformation model, and a separate generative model to produce audio waveforms as an output, called a vocoder. For the case of the SpeechT5 model from Microsoft, these three components are SpeechT5Processor, SpeechT5ForSpeechToSpeech, and SpeechT5HifiGan, respectively. Note that while the preprocessor and model require the same checkpoint, the generative vocoder model is generic. To generate speech, we need to know the speech characteristics of the speaker.

7. Speech embeddings

These characteristics such as stress, rhythm, and pitch are encoded in speech embeddings. These are provided as additional inputs to the model. Let's see how they're generated.

8. Generating speaker embeddings

To create a speaker embedding, we need an encoder. We'll use EncoderClassifier from SpeechBrain to load a model with the .from_hparams() method. The .encode_batch() method applies the encoder to an audio array to create the speech embedding. Note that we need to cast the waveform as a PyTorch tensor to use the encoder, and the resulting speech embedding should be normalized to be compatible with our model. We achieve this with the normalize() function from torch.nn.functional. For single datapoints, the .unsqueeze() method also is required to remove unused dimensions.

9. Audio generation

With our speech embeddings created, we can set up the processor to extract features from the audio input, resampled to the required sampling rate. To generate the new, denoised audio, we pass the input_values of the preprocessed data to .generate_speech(), along with speech_embedding and vocoder. The effect of denoising can be seen with brighter, more visible bands in the spectrogram.

10. Let's practice!

Time to give this a go!