Get startedGet started for free

Evaluation metrics for text classification

1. Evaluation metrics for text classification

Let's evaluate our text classification models.

2. Why evaluation metrics matter

Picture this: Our model, designed to assess the sentiment of book reviews, suggests that a best-seller has mostly negative reviews. Should we accept its judgment? We can use evaluation metrics to answer this.

3. Evaluation RNN Models

Before evaluating, we must generate predictions from the model. First, we pass the test dataset through the model to obtain the output predictions for each class. Next, we store the predictions in the predicted variable using the torch-dot-max function that returns the indexes of the maximum values along the specified dimension, indicated by the argument one. We'll use the predicted variable for evaluation metrics.

4. Accuracy

The most straightforward metric is accuracy, the ratio of correct predictions to the total predictions. Using torchmetrics, the tensors actual represent our actual labels, and predicted the model predictions. We want to determine if an instance belongs to class zero or class one, a binary classification. The accuracy class is initialized with a binary task and num_classes set to two for our two categories. The task can also be multiclass if there are more than two categories to classify. Passing labels to the accuracy instance gives the model's accuracy score. A score of zero-point-66 indicates the model predicted just over 66 percent of the samples correctly. A good score can vary based on the complexity of the problem. Scores range from zero to one, with higher scores representing greater accuracy. For example, zero-point-75 may be reasonable for sentiment analysis but poor elsewhere. As we learn more about metrics, we'll see that accuracy alone doesn't capture everything.

5. Beyond accuracy

Imagine a dataset of 10,000 book reviews where 9,800 readers adore the book and 200 found faults. Let's assume our model predicts all instances as positive, making it 98% accurate! But look closer. Such a model can't classify a single negative sentiment. Enter precision, which questions the model's confidence in labeling a review as negative. Recall checks how well the model spots actual negative reviews. The F1 Score harmonizes these two, ensuring neither is neglected. If we were to trust accuracy alone, we'd miss significant feedback. Let's explore each in more detail.

6. Precision and Recall

Precision is the ratio of correctly predicted positive observations to the total predicted positives. Recall is the ratio of correctly predicted positive observations to all observations in the actual positive class. To calculate these, we import the Precision and Recall classes from torchmetrics, use the same parameters as before, and print the results.

7. Precision and Recall

A precision of zero-point-six-six suggests that out of all positive predictions, just over 66 percent were accurate. Meanwhile, a recall of zero-point-five signifies the model captured 50 percent of all genuine positives. Like accuracy, the scores range from zero to one. The complexity of the problem needs to be considered when defining a score as good or bad.

8. F1 score

The F1 Score harmonizes precision and recall and is especially useful when dealing with imbalanced classes. To calculate it, we import the F1 Score class from torchmetrics and instantiate it with the same parameters. An F1 Score of one indicates perfect precision and recall, while a score of zero indicates the worst possible performance. Here F1 Score of zero-point-57 suggests a reasonably balanced trade-off between precision and recall, but this trade-off will depend on the task.

9. Considerations

In some instances, such as with multi-class classification, we may find that all scores are identical. Generally, this indicates a model is performing well across all classes. But remember to always consider the problem when interpreting results!

10. Let's practice!

Well done! Time for some evaluation practice.