Get startedGet started for free

Multimodal video generation

1. Multimodal video generation

Let's make the final leap from generating images to generating videos!

2. Video generation

In order to generate video, multiple steps are needed that could be seen as independent models. There's text-to-image key frame generation, image-to-image frame interpolation, and image-to-image upscaling, known as super-resolution. The end result is a seamless data format generated from a text prompt.

3. Video generation

We'll use the CogVideoX model from THUDM, which is a diffusion model that incorporates additional model components for image-to-image tasks such as interpolation. This is needed to make a consistent sequence of images without jumps between video frames. We use the CogVideoXPipeline from diffusers and the .from_pretrained() method, specifying the checkpoint and floating point precision. Diffusion pipelines are often very large, which is why cpu offloading has been added. This uses GPU memory to evaluate parts of the model only when required rather than loading the whole model into memory. The .enable_slicing() and .enable_tiling() options break the model into chunks for parallelizing and efficient memory usage.

4. Video generation

Let's examine how to generate videos using text prompts. Here, we create a vivid scene of a lion in the savanna. The pipe we defined takes our prompt, the number of inference steps, which controls quality, the desired frame count for our video, and a guidance scale that balances creativity with prompt adherence. The guidance scale ranges from 1-20, where 1 means the prompt is ignored and 20 means strict adherence with potentially worse quality. We also set a custom seed for reproducibility using a torch generator. The output will be a sequence of frames showing our lion scene coming to life, accessed using the .frames attribute.

5. Video generation

Now let's save our generated frames as a video and convert it to a shareable GIF format. We'll use export_to_video from diffusers.utils to export the frames to an MP4 video file with a given frame rate, specified with fps. Then, we use moviepy to convert the MP4 to a GIF, passing the file path to VideoFileClip and calling the .write_gif() method. How cool is that!?

6. Quantitative analysis

One challenge in video generation is ensuring the output stays faithful to the original prompt throughout all frames. CLIP offers a potential solution for measuring prompt adherence. We can compare each generated frame against the text prompt using the CLIP score. This allows us to quantitatively assess how well the video maintains consistency with the prompt, and identify any frames that drift too far away.

7. Quantitative analysis

To begin making CLIP score assessments of the video frames, we'll create a partial version of the clip_score() function using partial() from functools. This is purely for convenience, as it means we can essentially set the model checkpoint as the default value for the clip_score() function, rather than specifying it every time we use it. Then, for each frame, we convert the array to an integer tensor and reorder the channel dimension to be brought to the first position using the .permute() method. After making a list of scores we can then take the average. In this case, we see a reasonably good score of over 30.

8. Let's practice!

Now it's your turn to generate a video!