Get startedGet started for free

Model design and data collection

1. Model design and data collection

Welcome back! Let's explore the process of developing a generative AI model, starting with model design and data collection.

2. Know how to fill the tank

Let's imagine someone trying to drive a car if they'd never seen inside one before. When they see a gauge pointing to E, they wouldn't realize that the gas tank is empty and needs refueling. Even without personally building cars, understanding a little about how they work can help a driver. The same goes for generative AI. We don't all need to create generative AI models. For many use cases, an existing model or AI product might do the job and be cost-effective. Whether or not we are training models ourselves, having some idea about how they work will help us make the best use of them. Let's dive in.

3. Developing a model

Developing new generative AI models involves four key steps: research and design to decide on a model architecture; training data collection and preparation; model training; and finally model evaluation. We'll cover the first two steps in this video, and go deeper into model training and evaluation later on.

4. Stable Diffusion's research and development

Let's take Stable Diffusion, an image generative AI released in 2022, as an example. It was one of the first models that could take any text prompt and return unique and aesthetic images as responses. Their research and development process included the following: First, they defined their core purpose and use cases. They decided to make an image generation tool because they believed it was a good way to advance their mission of accessible AI that inspires creativity. Then, their dozens of researchers devised an architecture. They settled on a diffusion model, a type of generative model that creates images from static. Finally, they established a general idea of the resources required for building the model. Ultimately, it required hundreds of GPUs running in the cloud for 150,000 hours, which at the time cost about $600,000 US dollars.

5. Data collection: not your typical ML model

Generative AI models require massive amounts of data for training because they are learning to generate new data. This differs from discriminative models, which classify existing data. How much data are we talking about? Well, Stable Diffusion required 2 billion images, or 100,000 gigabytes of training data. The data also needs to be diverse, so that it can represent the domain. Here are just a few images of blue cats in the dataset, including photographs and different styles of cartoons. Just like with other types of machine learning, before training, data must be preprocessed, or adjusted to improve quality and format in a way the model can accept. Stable Diffusion needed to adjust the sizes and other characteristics of those 2 billion images so that their model could learn from them.

6. Data collection: privacy and security are critical

It's also worth remembering that privacy is critical during data collection, as very large datasets tend to include user-generated content that has personally identifiable information (PII). In many cases, developers must anonymize or aggregate data to remove individual-level details. For instance, by blurring out faces in datasets of security camera footage. In addition, security measures should be in place to prevent unauthorized access or misuse of the data. Sensitive data should be stored in a way that limits and monitors all access. If developers fail to take precautions during data collection, the models they train may be subject to copyright and ownership concerns. We'll explore this more in a later video.

7. Let's practice!

Time for some exercises to see what we've learned.