Generating new speech
Time to complete your mastery of using Hugging Face audio models! You'll use a fine-tuned model to generate new speech for a given voice. You will choose a voice from the VCTK Corpus as the basis for the new audio.
The dataset
and SpeechT5ForTextToSpeech
model (model
) have already been loaded, and a make_spectogram()
function has been provided to aid with plotting.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Load a sample speaker embedding from index
5
of the testdataset
. - Generate the speech from the processed text by specifying the
inputs
,speaker_embedding
, andvocoder
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
text = "Hi, welcome to your new voice."
# Load a speaker embedding from the dataset
speaker_embedding = torch.tensor(dataset[5]["____"]).unsqueeze(0)
inputs = processor(text=text, return_tensors="pt")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Generate speech
speech = model.generate_speech(____["input_ids"], ____, ____=____)
make_spectrogram(speech)