Get startedGet started for free

Assessing video generation performance

You can assess the performance of your video generation pipelines using a multi-modal CLIP model, which tests the similarity between each video frame image and the prompt. You will use this to assess just how well your generated video from the previous exercise matches the prompt.

The load_video() function has been imported from diffusers.utils for you. The clip_score module has also been imported from torchmetrics.

This exercise is part of the course

Multi-Modal Models with Hugging Face

View Course

Exercise instructions

  • Set up a CLIP scoring function called clip_score_fn() from the clip_score() metric.
  • Calculate the score for the frame and prompt.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

frames = load_video(video_path)

# Setup CLIP scoring
clip_score_fn = partial(____, model_name_or_path="openai/clip-vit-base-patch32")

scores = []
for frame in frames:
  frame = np.array(frame)
  frame_int = (frame * 255).astype("uint8")
  frame_tensor = torch.from_numpy(frame_int).unsqueeze(0).permute(0, 3, 1, 2)
  
  # Calculate the score using the CLIP model
  score = ____(____, [____]).detach()
  scores.append(float(score))

avg_clip_score = round(np.mean(scores), 4)
print(f"Average CLIP score: {avg_clip_score}")
Edit and Run Code