Preprocessing different modalities

1. Preprocessing different modalities

Welcome back! Let's learn about some of the preprocessing steps required to use text, image, and audio data with Hugging Face models.

2. Preprocessing text

Let's start with text preprocessing. To transform raw text into something that models can understand, we use a tokenizer.

3. Preprocessing text

The preprocessing starts with normalizing text by lowercasing it, removing special characters, and handling whitespace.

4. Preprocessing text

Then, we pre-tokenize the text - breaking it down into individual words or subwords. Special tokens are often used to denote the beginning and end of the text.

5. Preprocessing text

The tokens are then encoded using a model vocabulary into token IDs.

6. Preprocessing text

Finally, we add padding tokens, which are commonly zeros, to the end of the sequence. This standardization is essential because many models require input sequences to have the same length. Let's try this out using Hugging Face.

7. Preprocessing text

We start by importing the AutoTokenizer class from the transformers library, loading DistilBERT's tokenizer with the .from_pretrained() method, and defining the text string to preprocess. We can access the normalizer from the tokenizer using its .backend_normalizer attribute. Here, we normalize our text using .normalize_str(), which lowercases the string and replaces the accented "e" with a regular "e". The tokenizer we loaded can take care of the entire preprocessing pipeline with a single line of code. The return_tensors argument is important to pass the inputs to the downstream PyTorch model, which is what the "pt" stands for. Additionally, we can specify padding tokens with the padding argument, although none will be added when processing a single string. There we have it!

8. Preprocessing images

Moving from text to images, preprocessing starts with normalizing pixel intensities - typically scaling to the mean and standard deviation of the dataset - and resizing the images. It's important that these transformation match what was used to train the model, which can often be found in the model card. Let's give this a try!

9. Preprocessing images

We first load the BlipForConditionalGeneration image captioning model, along with its corresponding BlipProcessor. All of these are accessed using the same checkpoint to ensure consistency. The workflow involves encoding the image, generating the encoded caption, then decoding the output into a readable caption. We pass the image to the processor, which contains all of the necessary preprocessing in a single line. Then, unpack the preprocessed inputs with double-asterisks, and call model.generate() to create the caption. This caption will be in token IDs, so we have to decode it using the processor again to get a text output.

10. Preprocessing audio

Audio preprocessing involves three steps: we convert the raw audio waveform into a sequential array and apply filtering or padding to handle varying lengths, similar to text padding. Then, we ensure the correct sampling rate through resampling. Finally, we perform feature extraction, converting the audio into spectrograms, which most Hugging Face audio models expect.

11. Preprocessing audio

The Audio class from the Hugging Face datasets library performs resampling on audio arrays from datasets. This is used in combination with the .cast_column() method, where the required sampling rate is provided to the Audio class. We'll preprocess this dataset using OpenAI's Whisper-small processor, which handles all the necessary audio transformations. We need to make sure to match the sampling rate of the original training data, which can be found in the model card.

12. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.