LoslegenKostenlos loslegen

Fine-tuning a text-to-speech model

You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.

The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.

Please do not call the .train() method on the trainer config, as this code will time out in this environment.

Diese Übung ist Teil des Kurses

Multi-Modal Models with Hugging Face

Kurs anzeigen

Anleitung zur Übung

  • Load the microsoft/speecht5_tts pretrained model using SpeechT5ForTextToSpeech.
  • Create an instance of Seq2SeqTrainingArguments with: gradient_accumulation_steps set to 8, learning_rate set to 0.00001, warmup_steps set to 500, and max_steps set to 4000.
  • Configure the trainer with the new training arguments, and the model, data, and processor provided.

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

# Load the text-to-speech pretrained model
model = ____.____(____)

# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
    gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
    push_to_hub=False)

# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
    train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)
Code bearbeiten und ausführen