Fine-tuning a text-to-speech model
You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.
The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.
Please do not call the .train() method on the trainer config, as this code will time out in this environment.
Deze oefening maakt deel uit van de cursus
Multi-Modal Models with Hugging Face
Oefeninstructies
- Load the
microsoft/speecht5_ttspretrained model usingSpeechT5ForTextToSpeech. - Create an instance of
Seq2SeqTrainingArgumentswith:gradient_accumulation_stepsset to8,learning_rateset to0.00001,warmup_stepsset to500, andmax_stepsset to4000. - Configure the trainer with the new training arguments, and the
model, data, andprocessorprovided.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Load the text-to-speech pretrained model
model = ____.____(____)
# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
push_to_hub=False)
# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)