Get startedGet started for free

Pipeline tasks and evaluations

1. Pipeline tasks and evaluations

Let's begin to build model pipelines for different data modalities!

2. Pipelines vs. model components

Previously, we've loaded individual processors and models from model checkpoints using separate classes and the .from_pretrained() method; for example, BlipProcessor and BlipForConditionalGeneration for caption generation. If we don't need to add custom transformations to the data, then we can load the entire pipeline in a single line, specifying the task; for example, image-to-text, and the model checkpoint.

3. Example comparison

Let's see how the approaches compares with an image-to-text example. Recall that to generate a sample caption from the flickr30k dataset, we preprocessed the image and passed it explicitly to the model. With the pipeline we defined, we only need a single parameter: the raw image from the dataset, and the encoding and decoding is done for us. The pipe returns the prediction in the form of a dictionary.

4. Finding models and tasks

We've already seen how to find models for a given task using the Hugging Face API. We can use the associated checkpoint ID, obtained with the .id attribute to instantiate a pipeline. We can also see if a model has associated tasks by looking at its model card. Here, we can see the available tasks for the BLIP captioning model we've been using.

5. Passing options to models

Although the pipeline simplifies processing and prediction, we can still pass additional arguments to the model or preprocessor. Consider facebook's music generation model, which takes a text prompt and generates a corresponding waveform according to the description. We'll load the model as a pipeline using PyTorch tensors, indicated with "pt". Behind-the-scenes, this loads the model as an instance of the MusicgenForConditionalGeneration class. We can pass a dictionary altering the default temperature and max_new_tokens model parameters to the pipeline via the generate_kwargs argument. The temperature, ranging from 0-to-1, controls randomness and creativity. max_new_tokens limits the number of tokens generated by the model. Generating the music then needs only the prompt plus these additional arguments.

6. Evaluating pipeline performance

Now that we can use pipelines, it is important to know how well they perform. The Hugging Face evaluate library provides functionality for calculating many common machine learning metrics, including accuracy, precision, recall, and F1-score. Accuracy is the total proportion of correct classifications, precision measures how often class predictions are correct, and recall determines how many actual classes were correctly identified. F1-score is a combination of precision and recall. The evaluator class takes a supported task name as input, and we set up a dictionary of metrics that we want to calculate. We also define a label mapping, which maps the model output IDs and labels. The mapping from a class label to a numerical ID is accessible from the pipeline via the .label2id attribute of pipe.model.config.

7. Evaluating pipeline performance

Performing the evaluation requires the pipeline or model and the dataset to be tested on. For example, a model to distinguish real and AI-generated images and its associated dataset. To calculate multiple metrics at the same time, we pass the dictionary of metrics to the combine() function from evaluate. Finally, we add the mapping of the class labels, which is required for correct interpretation of model outputs. We can see our pipeline did a great job with this dataset!

8. Let's practice!

Let's practice working with pipelines!