Creating speech embeddings
Time to encode an audio array into a speaker embedding! Speaker embeddings contain information about how to personalize generated audio to a given speaker, and are essential for generating fine-tuned audio.
The pretrained spkrec-xvect-voxceleb
model (speaker_model
) and VCTK dataset (dataset
) have been loaded for you.
This exercise is part of the course
Multi-Modal Models with Hugging Face
Exercise instructions
- Complete the
create_speaker_embedding()
function definition by calculating the raw embedding from thewaveform
using thespeaker_model
. - Extract the audio array from the data point at index
10
of thedataset
. - Calculate a speaker embedding from the audio array using the
create_speaker_embedding()
function.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def create_speaker_embedding(waveform):
with torch.no_grad():
# Calculate the raw embedding from the speaker_model
speaker_embeddings = ____.____(torch.tensor(____))
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
return speaker_embeddings
# Extract the audio array from the dataset
audio_array = dataset[10]["____"]["____"]
# Calculate the speaker_embedding from the datapoint
speaker_embedding = ____(____)
print(speaker_embedding.shape)