Get startedGet started for free

Generating new speech

Time to complete your mastery of using Hugging Face audio models! You'll use a fine-tuned model to generate new speech for a given voice. You will choose a voice from the VCTK Corpus as the basis for the new audio.

The dataset and SpeechT5ForTextToSpeech model (model) have already been loaded, and a make_spectogram() function has been provided to aid with plotting.

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Load a sample speaker embedding from index 5 of the test dataset.
  • Generate the speech from the processed text by specifying the inputs, speaker_embedding, and vocoder.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

text = "Hi, welcome to your new voice."

# Load a speaker embedding from the dataset
speaker_embedding = torch.tensor(dataset[5]["____"]).unsqueeze(0)

inputs = processor(text=text, return_tensors="pt")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Generate speech
speech = model.generate_speech(____["input_ids"], ____, ____=____)

make_spectrogram(speech)
Edit and Run Code