1. Linear Support Vector Machines
In this chapter, we'll introduce the simplest support vector classifier, one in which the decision boundary is a straight line. We'll use the dataset we generated in the previous chapter, which has a linear decision boundary by construction.
2. Split into training and test sets
The dataset is in the dataframe df. The first task is to split it into training and test sets. We do this by assigning rows randomly to the training and test sets in a 80/20 proportion. This is what the code shown in the slide achieves: set the seed and do the random 80/20 split, and create separate dataframes for the training and test sets.
3. Decision boundaries and kernels
Note that in SVM classifiers, decision boundaries can be of different types: straight lines, polynomials, or even more complicated functions. The type of decision boundary is called a kernel and has to be specified upfront. We will say more about kernels as we work our way through the course. For now, just note that we'll use linear kernels in this chapter as we know our decision boundary is a straight line.
4. SVM with linear kernel
In this course, we will use the svm() function from the e1071 library. The function has a number of parameters. We'll set the following explicitly: 1) formula: which is a formula specifying the dependent and independent variables. 2) data: which is the dataframe containing the data, the trainset dataframe in our case. 3) type, which refers to the type of the algorithm. Since ours is a classification problem, we set this to C-classification. There is another type of classification algorithm called nu-classification, which we will not cover in this course. 4) kernel: we set this to linear as our dataset is linearly separable. 5) Cost and gamma: these are tuning parameters, which we'll leave at their default values for now. 6) scale: this is a Boolean variable indicating whether data should be scaled or not. We set this to FALSE to enable plotting of the classifier against the original, unscaled data. In most real life situations you would set this to TRUE.
5. Building a linear SVM
OK, so we load the e1071 library and invoke the svm() function specifying the parameters mentioned earlier. The results are assigned to the variable svm_model, which we now examine.
6. Overview of model
Typing in the name of the variable containing the model gives an overview of the model including the classification, kernel type, and the values of the tuning parameters, cost and gamma, which, as you'll recall, we left at their defaults. We now see that these default values are 1 and 0-point-5, respectively. We also see that the model has a fairly large number of support vectors, 55 in all. In the next lesson, we'll talk about what support vectors are and why they are called support vectors. But before we do that, let's explore the contents of our model a bit further.
7. Exploring the model
The first one, index, lists the indices of the support vectors in the training set. SV contains the support vector coordinates; rho, the negative y intercept of the decision boundary; and coefs contains the weighting coefficients of the support vectors. The magnitude of the coefficients indicate the importance of the support vector and the sign indicates which side of the boundary it's on.
8. Model accuracy
Finally, we obtain class predictions for the training and test sets and use these to calculate accuracy. The accuracies are perfect, which is no surprise since the dataset is linearly separable. However, as we will see in the next lesson accuracy, by itself, is misleading.
9. Time to practice!
But before that, let's do a few exercises.