Preprocess images and audio for training

1. Preprocess images and audio for training

Distributed training begins with preprocessing to standardize inputs and splitting data across devices to speed up training.

2. Preparing images and audio

We're creating an app to enhance accessibility for people with low vision by identifying objects in images and responding to voice commands like "Turn down the volume." During preprocessing, we'll use Accelerator for data sharding, distributing data across devices to enable parallel processing.

3. Manipulating a sample image dataset

Let's consider an example image dataset. Printing it shows the structure consisting of images and labels. To access the first image, we display the dataset at index zero, specifying the image feature. The output is a JPEG image file of size 720 by 480.

4. Standardize the image format

Models expect images to have a certain width and height, and pixel values to have a certain mean and standard deviation. During preprocessing, we adjust the image format accordingly. Then, we load an image model and call AutoImageProcessor.from_pretrained() to load the processing steps for a model.

5. Standardize the image format

We apply a lambda function to the dataset to create a pixel_values feature by standardizing images using the image_processor. Displaying the dataset shows the new pixel_values feature.

6. Manipulating a sample audio dataset

Let's move to audio preprocessing. First, we print an example dataset to see its structure. This audio dataset is a nested dictionary with train, validation, and test splits. Within the train split, we can access audio (which are digitized arrays of voice commands) and labels (which are text commands).

7. Standardize the audio format

Next, let's standardize the audio format. Models expect audio to have a certain number of samples, determined by two properties: the sampling rate is the number of samples per second (for example, 16kHz), and the max duration is the number of seconds of audio (for example, 1s). So the max number of samples, or max length, is the sampling rate times max duration, or 16000 samples.

8. Standardize the audio format

Now, we load a feature_extractor to standardize the audio for a model. Defining a preprocess_function, we extract audio_arrays, apply the feature_extractor to the arrays, specify the sampling_rate and max_length, and truncate the array if it is longer than max_length.

9. Apply the preprocesssing function

Next, we map the preprocess_function to the dataset. "remove_columns" will remove the audio and file columns after applying the preprocess_function, and "batched" specifies to process examples in batches.

10. Apply the preprocesssing function

Displaying the dataset shows the updated structure with the standardized input_values.

11. Prepare data for distributed training

After preprocessing, we need to prepare the data for loading and iterating during training by creating a DataLoader, which ensures that the data is batched and images or audio are shuffled. For distributed training, we place the data tensors on CPUs or GPUs by passing the DataLoader to the accelerator.prepare() method, so each device processes a subset of data in parallel, implementing data sharding. prepare() works with PyTorch DataLoaders, which have type torch.utils.data.DataLoader.

12. Let's practice!

Your turn to preprocess images and audio!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.