Fine-tuning a text-to-speech model
You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.
The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.
Please do not call the .train() method on the trainer config, as this code will time out in this environment.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Load the
microsoft/speecht5_ttspretrained model usingSpeechT5ForTextToSpeech. - Create an instance of
Seq2SeqTrainingArgumentswith:gradient_accumulation_stepsset to8,learning_rateset to0.00001,warmup_stepsset to500, andmax_stepsset to4000. - Configure the trainer with the new training arguments, and the
model, data, andprocessorprovided.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the text-to-speech pretrained model
model = ____.____(____)
# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
push_to_hub=False)
# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)