1. Evaluate models with Accelerator
Now, let's evaluate models using Accelerator!
2. Why put a model in evaluation mode?
Let's compare evaluation mode with training mode. In training, the model uses techniques like dropout (which sets certain neurons to zero) and batch normalization to prevent overfitting.
3. Why put a model in evaluation mode?
Evaluation mode disables these layers, ensuring our model makes consistent predictions for our sentiment analysis application. We call model.eval() to activate this mode.
4. Disable gradients with torch.no_grad()
Another difference between training and evaluation is gradient computations. In training, errors propagate backward from the output layer through the hidden layers to update the weights. During evaluation, we disable gradients using torch.no_grad() to save memory and improve run times. Thus, during evaluation, we use both model.eval() and torch.no_grad() together.
5. Prepare a validation dataset
Let's consider the validation steps from data preparation to model evaluation. We'll load a validation dataset, specifying the dataset name. "glue" refers to a collection of natural language tasks, and MRPC is one of the tasks in the benchmark. It consists of sentence pairs and labels indicating whether they are paraphrases. Then, we tokenize the dataset using the same encode function from training.
6. Life of an epoch: training and evaluation loops
After preparing the dataset, we'll iterate over the train and validation datasets for each epoch. Within an epoch, we enable training mode with model.train(). Then we loop through the validation set in evaluation mode using model.eval(). Finally, metrics are logged outside the evaluation loop for each epoch. Combining training and evaluation loops in this way helps track performance across epochs.
7. Inside the evaluation loop
The evaluation loop starts by loading evaluation metrics specific to the dataset using evaluate.load() to retrieve accuracy and F1 score. We'll disable gradient computations when making predictions for each batch by using "with torch.no_grad()". Predictions and labels from all devices are then collected using accelerator.gather_for_metrics(), akin to compiling survey results from a community. We define a batch for computing metrics, using metric.add_batch(). Finally, we call the .compute() method to retrieve the metric values, which provides the accuracy and F1 score after one epoch.
8. Log metrics after evaluation
After evaluation, Accelerator tracks metrics during distributed training, integrating with tools like TensorBoard and MLflow to log results. Here we focus on where logging fits into the training workflow, rather than how to use the tracking tools (see documentation on tools like TensorBoard for installation instructions). We initialize Accelerator by specifying the project directory for saving results and setting log_with to "all" to detect installed tracking tools. Then we call init_trackers(), specifying a project name, to initialize tracking tools. Inside the epoch loops, we call accelerator.log() after the evaluation loop. We pass it a dictionary of metrics to log and specify step as the current epoch. Finally, we call accelerator.end_training() to notify the tracking tools that training has finished. The tracking tools will log metrics, so we can monitor performance during training.
9. Let's practice!
Now it's your turn!