1. Learn
  2. /
  3. Courses
  4. /
  5. Multi-Modal Models with Hugging Face

Connected

Exercise

Fine-tuning a text-to-speech model

You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.

The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.

Please do not call the .train() method on the trainer config, as this code will time out in this environment.

Instructions

100 XP
  • Load the microsoft/speecht5_tts pretrained model using SpeechT5ForTextToSpeech.
  • Create an instance of Seq2SeqTrainingArguments with: gradient_accumulation_steps set to 8, learning_rate set to 0.00001, warmup_steps set to 500, and max_steps set to 4000.
  • Configure the trainer with the new training arguments, and the model, data, and processor provided.