1. Multiclass problems
In this lesson we're going to take a detour and learn about how the SVM algorithm deals with classification problems with more than two classes. We'll use the ubiquitous iris dataset first introduced by Sir Ronald Fisher in 1936. The dataset is almost linearly separable and thus gives us an opportunity to apply what we've learned so far to a real world dataset.
2. The iris dataset - an introduction
The dataset consists of 150 observations of five attributes of iris plants. Four attributes are numerical: petal width, petal length, sepal width, and sepal length. The fifth attribute, species, is categorical and can take on one of three values: setosa, virginica, and versicolor. The dataset is available for download at the UCI Machine Learning Repository.
3. Visualizing the iris dataset
Let's get a feel for the data by plotting it as a function of petal length and petal width. We can do this easily using ggplot().
4. Plot of dataset
On this plane we see a clear linear boundary between setosa and the other two species, versicolor and virginica. The boundary between the latter two is almost linear. Since there are four predictors, one would have to plot the other combinations to get a better feel for the data. I'll leave this as an exercise for you and move on with the assumption that the data is nearly linearly separable. If the assumption is grossly incorrect, a linear SVM will not work well.
5. How does the SVM algorithm deal with multiclass problems?
SVMs are essentially binary classifiers, so how can we apply them to datasets that have more than two classes such as the iris dataset? It turns out that there's a simple and quite general voting strategy to do this. Here's how it works. We first partition the dataset into subsets containing two classes each. In the case of the iris dataset we would get three subsets, one for each possible binary combination: setosa/versicolor, setosa/virginica and versicolor/virginica. The three classification problems are solved separately. After that each data point is assigned the majority prediction, with ties being broken by a random equiprobable selection. This method is called a one-against-one classification strategy and can be applied to a variety of binary classifiers, not just SVMs. The nice thing, as we will see next, is that the e1071 svm() algorithm does all the work for us automatically.
6. Building a multiclass linear SVM
Let's build a linear SVM for the iris dataset. In case you want to reproduce the calculation, note that I've partitioned the dataset into training and test datasets using a 80/20 split in the usual way and I have set the seed integer to 10. You'll notice that there is absolutely no difference between the code for the multi and binary class cases. The algorithm generates the binary subsets, builds classification models and gets the majority prediction for each datapoint automatically, without any additional user input.
As far as the accuracy is concerned, things look pretty good. We get a test accuracy of 96%, which indicates that the dataset is indeed almost linearly separable. Before closing, I should mention a point that I have glossed over so far in this course: to get a robust measure of model performance, one should average the accuracy over a range of distinct training and test sets. We'll do this in the final exercise for this lesson.
7. Time to practice!
Well, that's it for this chapter in which we developed an intuition for how linear SVMs work, how their margins can be tuned using the cost parameter and how the algorithm handles multiclass problems. In the next chapter we'll explore more complex SVMs. But first, some exercises.