Fine-tuning a text-to-speech model
You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.
The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.
Please do not call the .train() method on the trainer config, as this code will time out in this environment.
Este exercício faz parte do curso
Multi-Modal Models with Hugging Face
Instruções do exercício
- Load the
microsoft/speecht5_ttspretrained model usingSpeechT5ForTextToSpeech. - Create an instance of
Seq2SeqTrainingArgumentswith:gradient_accumulation_stepsset to8,learning_rateset to0.00001,warmup_stepsset to500, andmax_stepsset to4000. - Configure the trainer with the new training arguments, and the
model, data, andprocessorprovided.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
# Load the text-to-speech pretrained model
model = ____.____(____)
# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
push_to_hub=False)
# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)