Get startedGet started for free

Creating speech embeddings

Time to encode an audio array into a speaker embedding! Speaker embeddings contain information about how to personalize generated audio to a given speaker, and are essential for generating fine-tuned audio.

The pretrained spkrec-xvect-voxceleb model (speaker_model) and VCTK dataset (dataset) have been loaded for you.

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Complete the create_speaker_embedding() function definition by calculating the raw embedding from the waveform using the speaker_model.
  • Extract the audio array from the data point at index 10 of the dataset.
  • Calculate a speaker embedding from the audio array using the create_speaker_embedding() function.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def create_speaker_embedding(waveform):
    with torch.no_grad():
        # Calculate the raw embedding from the speaker_model
        speaker_embeddings = ____.____(torch.tensor(____))
        
        speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
        speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
    return speaker_embeddings

# Extract the audio array from the dataset
audio_array = dataset[10]["____"]["____"]

# Calculate the speaker_embedding from the datapoint
speaker_embedding = ____(____)
print(speaker_embedding.shape)
Edit and Run Code