Get startedGet started for free

Multi-class logistic regression

1. Multi-class logistic regression

Multi-class classification means having more than 2 classes. While we've used scikit-learn to perform multi-class classification, all of our conceptual discussions have been in the binary, or 2-class, case. In this video, we'll discuss how multi-class classification works for linear classifiers.

2. Combining binary classifiers with one-vs-rest

We'll cover two popular approaches to multi-class classification. The first is to train a series of binary classifiers for each class. For example, I've loaded the wine dataset and instantiated 3 logistic regression classifiers. I'll now fit these classifiers on 3 different data sets. The code y==0 returns an array the same size as y that's True when y is 0 and False otherwise, so the classifier learns to predict these true/false values. In other words, it's a binary classifier learning to discriminate between class 0 or not 0. The next one learns y=1 vs. not 1, etc. This is called the one-vs-rest strategy. In order to make predictions using one-vs-rest, we take the class whose classifier gives the largest raw model output - or decision_function, in scikit-learn terminology. In this case, the largest raw model output comes from classifier 0. This means it's more confident that the class is 0 than any of the other classes, so we predict class 0. We can just let scikit-learn do the work by fitting a logistic regression model on the original multi-class data set, setting the multi-class parameter to "ovr". We get the same prediction, of 0, as expected.

3. One-vs-rest vs. multinomial/softmax

Another way to achieve multi-class classification with logistic regression is to modify the loss function so that it directly tries to optimize accuracy on the multi-class problem. You may encounter various words related to this, like multinomial logistic regression, softmax, or cross-entropy loss. The slide shows a comparison of the two approaches. In one case you fit separately for each class, whereas in the other you just do it once. The same goes for prediction. An appealing property of the binary approach is that you can reuse your binary classifier implementation rather than needing a new one. On the other hand, you might sometimes get better accuracy with the multinomial classifier since its loss is more directly aligned with accuracy. In the field of neural networks, the multinomial approach is standard. Finally, while both approaches can work for SVMs, one-vs-rest and related strategies tend to be more popular. By the way, both of these methods can output probabilities just like a binary classifier.

4. Model coefficients for multi-class

We've talked a lot about the coefficients, so it's natural to ask, what do the coefficients look like for multi-class classification? Continuing with the wine dataset, let's fit a one-vs-rest model and look at the coefficients. In the binary case we had one coefficient per feature and one intercept. For 3 classes we now have 3 entire binary classifiers, so we end up with one coefficient per feature per class, and one intercept per class. Hence, the coefficients of this model are stored in a 3-by-13 array. We can instantiate the multinomial version by setting the multi_class argument to multinomial, which is also the default for non-binary classification. As we can see, The multinomial classifier has the same number of coefficients and intercepts as one-vs-rest. Although these two approaches work differently, they learn the same number of parameters and, roughly speaking, the parameters have the same interpretations.

5. Let's practice!

Your turn to explore these two multi-class approaches.