Get startedGet started for free

Text-guided image editing

1. Text-guided image editing

Let's now look at text-guided image editing.

2. Diffusers

Diffusion models are trained to produce images from random noise, and they've had lots of success in image generation. When coupled with the image-text similarity models, like CLIP, they provide a method of using text prompts to guide image generation and editing.

3. Custom image editing

In addition to a text prompt, the ControlNet approach uses an additional annotation image that acts as a starting-point for generation. Different models can work with different kinds of annotation images. For example, we can create an image of object edges using a canny edge detector. Let's give this a go!

4. Custom image editing

We'll load an image of the Mona-Lisa using the diffusers load_image() function. We'll use the Canny() function from the cv2 computer vision package to create the annotation image for the ControlNet, which takes an image array and identifies edges in the intensity range 100-to-200. We then concatenate three identical Canny images for each color channel

5. Custom image editing

and cast the array back to an image.

6. Custom image editing

To use a ControlNet we use the ControlNetModel class, choosing a model designed to work with canny annotation images. We also specify 16-bit float precision. We pass the controlnet to the StableDiffusionControlNetPipeline, which loads an appropriate Stable Diffusion model. Again we specify 16-bit float precision and send the pipeline to the GPU with .to("cuda").

7. Custom image editing

With the prompt, we specify the edits, requesting an image of Einstein, and letting the Canny image provide the starting point. The prompt specifies keywords, and doesn't need to be perfect English. A generator is defined with a seed for reproducibility. The pipeline takes the prompt text, the Canny image, and a negative prompt, in addition to the generator. The negative prompt provides the model with a contrasting prompt that guides the editing. Finally, the number of inference steps controls the quality of the output image. There we have it - Einstein in the Mona Lisa pose!

8. Image inpainting

Image inpainting provides a method to generate new content localized to a certain region defined by a binary mask.

9. Image inpainting

Masks can be the result of another model or defined by the user. They can also be created using online tools or with open source packages, like InpaintingMask-Generation on GitHub. Here, we make a mask to cover the Mona Lisa's head and shoulders.

10. Image inpainting

We can use the StableDiffusionControlNetInpaintPipeline in association with the ControlNetModel specialized for inpainting. Setting use_safetensors=True directs the model to use the secure safetensors format. Like StableDiffusionControlNetPipeline, StableDiffusionControlNetInpaintPipeline uses a standard stable diffusion checkpoint and specialized controlnet model.

11. Image inpainting

In order to inference the pipeline, a control image is required. Our control image will be the original image with the masked section removed. We can define a function to do this. We convert both images to numpy arrays and scale them to between 0 and 1. The pixel intensities of the control image are set to -1 for places where the mask is >0.5 (the pixels to replace), and the array dimensions are modified to be as expected by the model using a combination of the .expand_dims() and .transpose() methods.

12. Image inpainting

We can then use the pipeline to request a smile, specifying the inference steps, setting the eta noise parameter, providing our base image, mask, and control. The Mona Lisa now has a smile, which we accessed with the .images attribute!

13. Let's practice!

Let's edit images with prompts!