Fine-tuning a text-to-speech model

You will be working with the VCTK Corpus, which includes around 44-hours of speech data uttered by English speakers with various accents, to fine-tune a text-to-speech model to replicate regional accents.

The dataset has already been loaded and preprocessed, and the SpeechT5ForTextToSpeech module has been loaded, as have the Seq2SeqTrainingArguments and Seq2SeqTrainer modules. A data collator (data_collator) has been predefined.

Please do not call the .train() method on the trainer config, as this code will time out in this environment.

Diese Übung ist Teil des Kurses

Multi-Modal Models with Hugging Face

Anleitung zur Übung

Load the microsoft/speecht5_tts pretrained model using SpeechT5ForTextToSpeech.
Create an instance of Seq2SeqTrainingArguments with: gradient_accumulation_steps set to 8, learning_rate set to 0.00001, warmup_steps set to 500, and max_steps set to 4000.
Configure the trainer with the new training arguments, and the model, data, and processor provided.

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

# Load the text-to-speech pretrained model
model = ____.____(____)

# Configure the required training arguments
training_args = ____(output_dir="speecht5_finetuned_vctk_test",
    gradient_accumulation_steps=____, learning_rate=____, warmup_steps=____, max_steps=4000, label_names=["labels"],
    push_to_hub=False)

# Configure the trainer
trainer = ____(args=training_args, model=model, data_collator=data_collator,
    train_dataset=dataset["train"], eval_dataset=dataset["test"], tokenizer=processor)

Code bearbeiten und ausführen

Diese Übung ist Teil des Kurses

Multi-Modal Models with Hugging Face

Mittlere SchwierigkeitSchwierigkeitsgrad

4.8+

Kurs kostenlos starten

Navigate the Hugging Face model hub, transform raw text, audio, and visual data into AI-friendly formats. Learn how to find the latest most popular models for tasks such as text generation and harness the power of pre-built pipelines.

Exercise 1: Hugging Face model navigation Exercise 2: How many models!?Exercise 3: Finding the most popular text-to-image model Exercise 4: Preprocessing different modalities Exercise 5: Text tokenizing Exercise 6: Image preprocessing Exercise 7: Audio preprocessing Exercise 8: Pipeline tasks and evaluations Exercise 9: Pipeline caption generation Exercise 10: Passing keyword arguments Exercise 11: Model evaluation on a custom dataset

Learn to master individual modalities with state-of-the-art models. Dive into computer vision for image classification and segmentation, explore speech recognition and text-to-speech synthesis, and learn effective fine-tuning techniques. Build practical skills with pre-trained models from Hugging Face's transformers library.

Exercise 1: Computer vision Exercise 2: Image classification Exercise 3: Object detection Exercise 4: Image background removal Exercise 5: Fine-tuning computer vision models Exercise 6: CV fine-tuning: dataset prep Exercise 7: CV fine-tuning: model classes Exercise 8: CV fine-tuning: trainer configuration Exercise 9: Speech recognition and audio generation Exercise 10: Automatic speech recognition Exercise 11: Creating speech embeddings Exercise 12: Audio denoising Exercise 13: Fine-tuning text-to-speech models Exercise 14: Fine-tuning a text-to-speech model

Aktuelle Übung

Exercise 15: Generating new speech

Learn to fuse visual, textual, and audio information for richer AI applications. Master techniques like CLIP for zero-shot classification, build sentiment analyzers that see and read, and create emotion detectors that combine facial expressions with voice. Take your AI models beyond single-modality thinking.

Exercise 1: Zero-shot image classification Exercise 2: Zero-shot learning with CLIP Exercise 3: Automated caption quality assessment Exercise 4: Multi-modal sentiment analysis Exercise 5: Prompting Vision Language Models (VLMs)Exercise 6: Multi-modal sentiment classification with Qwen Exercise 7: Zero-shot video classification Exercise 8: Video audio splitting Exercise 9: Video sentiment analysis with CLIP CLAP

Transform ideas into reality! Master cutting-edge AI techniques to generate and manipulate visual content using text prompts. Create stunning images, edit photos intelligently, and build powerful question-answering systems for images and documents. Turn your creative vision into digital reality with multi-modal AI.

Exercise 1: Visual question-answering (VQA)Exercise 2: VQA with Vision Language Transformers (ViLTs)Exercise 3: Document VQA with LayoutLM Exercise 4: Image editing with diffusion models Exercise 5: Custom image editing Exercise 6: Image inpainting Exercise 7: Video generation Exercise 8: Build a video!Exercise 9: Assessing video generation performance Exercise 10: Congratulations!