Fine-tuning a text-to-speech model
You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.
The dataset
has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech
module has been loaded, as have the Seq2SeqTrainingArguments
and Seq2SeqTrainer
modules. A data collator (data_collator
) has been predefined.
Please do not call the .train()
method on the trainer config, as this code will time out in this environment.
Diese Übung ist Teil des Kurses
Multi-Modal Models with Hugging Face
Anleitung zur Übung
- Load the
microsoft/speecht5_tts
pretrained model usingSpeechT5ForTextToSpeech
. - Create an instance of
Seq2SeqTrainingArguments
with:gradient_accumulation_steps
set to8
,learning_rate
set to0.00001
,warmup_steps
set to500
, andmax_steps
set to4000
. - Configure the trainer with the new training arguments, and the
model
, data, andprocessor
provided.
Interaktive Übung
Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.
# Load the text-to-speech pretrained model
model = ____.____(____)
# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
push_to_hub=False)
# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)