Aan de slagGa gratis aan de slag

Fine-tuning a text-to-speech model

You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.

The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.

Please do not call the .train() method on the trainer config, as this code will time out in this environment.

Deze oefening maakt deel uit van de cursus

Multi-Modal Models with Hugging Face

Cursus bekijken

Oefeninstructies

  • Load the microsoft/speecht5_tts pretrained model using SpeechT5ForTextToSpeech.
  • Create an instance of Seq2SeqTrainingArguments with: gradient_accumulation_steps set to 8, learning_rate set to 0.00001, warmup_steps set to 500, and max_steps set to 4000.
  • Configure the trainer with the new training arguments, and the model, data, and processor provided.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Load the text-to-speech pretrained model
model = ____.____(____)

# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
    gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
    push_to_hub=False)

# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
    train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)
Code bewerken en uitvoeren