Fine-tuning text-to-speech models

1. Fine-tuning text-to-speech models

Welcome back! In this video, we will fine-tune pre-trained text-to-speech models for specific domains.

2. Purpose of fine-tuning text-to-speech

Fine-tuning text-to-speech model can help them recognize and produce unique sounds in new languages, such as click consonants in Xhosa or rolling or trilled "R" sounds in Italian and Spanish. Additionally, fine-tuning can adjust the model's style and delivery to better suit a particular application or context. The ubiquitous use of English datasets for model pretraining means we often need to refine models to produce realistic speech outside of the training domain; for example, to produce Italian speech or a new dialect of a language.

3. Purpose of fine-tuning text-to-speech

The features of the speech embedding are used together with the encoded features of the text-to-speech model as inputs for the generative model that makes the waveform. This means that if certain characteristics are not sufficiently represented in the training data; for example, an Italian accent, the introduction of an Italian speech embedding would likely not result in realistic Italian speech. Fine-tuning on new training data is required.

4. Preparing an audio dataset

We'll fine-tune Microsoft's T5 model using the VoxPopuli dataset from Meta on Hugging Face, which contains transcribed speech data from European languages recorded at the EU parliament. The dataset provides the option to download a subset of the data with a language identifier; we will use "it" for Italian. The dataset includes the audio and normalized_text, but we need to preprocess it further to add the speech embeddings. For this, we will use the same pretrained encoder as before.

5. Audio preprocessing

To preprocess the data, we need the SpeechT5Processor from the same model checkpoint, and we'll define a function to perform these preprocessing steps in series. We apply the processor to the normalized text and audio array, passing the sampling rate from the dataset. After preprocessing the audio, we need to remove the batch dimension from the labels entry by zero-indexing, and add the speech embeddings using .encode_batch() and normalize() as before. The .map() method then applies the prepare_dataset() function to the whole dataset.

6. Training arguments

To fine-tune a sequence-to-sequence model for text-to-audio generation, we often use the specialized Seq2SeqTrainingArguments class. Important parameters include gradient_accumulation_steps, which determines the number of batches of training data to process before updating the model; learning_rate, which controls the step size taken during optimization; and warmup_steps, which gradually increases learning_rate over this number of steps. The names of the model outputs can be specified with the label_names keyword.

7. Putting it all together

With the Training Arguments configured, we can use it to fine-tune our model. We start with choosing our pretrained model, its associated processor, and a vocoder. We pass the model, training arguments, a training and validation dataset, and the processor to the trainer. Since we are using a pretrained model, we should use the same SpeechT5Processor that was used to train the original model. Then the .train() method can be used start training.

8. Using the new model

Inferencing with our fine-tuned model is the same as with the original model, extracting the speaker embeddings, processing the text, and generating the new speech using the vocoder. Let's view the resulting spectrogram!

9. Using the new model

There we have it!

10. Let's practice!

Let's try this out!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.