Fine-tuning a text-to-speech model
You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.
The dataset
has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech
module has been loaded, as have the Seq2SeqTrainingArguments
and Seq2SeqTrainer
modules. A data collator (data_collator
) has been predefined.
Please do not call the .train()
method on the trainer config, as this code will time out in this environment.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Load the
microsoft/speecht5_tts
pretrained model usingSpeechT5ForTextToSpeech
. - Create an instance of
Seq2SeqTrainingArguments
with:gradient_accumulation_steps
set to8
,learning_rate
set to0.00001
,warmup_steps
set to500
, andmax_steps
set to4000
. - Configure the trainer with the new training arguments, and the
model
, data, andprocessor
provided.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the text-to-speech pretrained model
model = ____.____(____)
# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
push_to_hub=False)
# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)