Kernel SVMs

1. Kernel SVMs

In this video we'll see how to fit nonlinear boundaries using linear classifiers.

2. Transforming your features

Consider this 2D toy dataset. The two classes are not linearly separable; in other words, there's no linear boundary that perfectly classifies all the points. If you try fitting a linear SVM on these points, you might get back something that predicts blue everywhere.

3. Transforming your features

However, notice that the red points are all close to the point (0,0) in the coordinate system. Let's create two new features, one of which is feature 1 squared and the other of which is feature 2 squared. That means values near zero will become small values, and values far from zero, both positive and negative, will become large. What happens now if we plot the points?

4. Transforming your features

Well, now they are linearly separable in this transformed universe, because all the red points are near the lower left and all the blue points are above and to the right. We can fit a linear SVM using these new features and

5. Transforming your features

the result is a perfect classification. But then we might ask ourselves, what does this linear boundary look like back in the original space. In other words, if we took these axes and un-squared them, what would happen to the shape of the boundary?

6. Transforming your features

In this case, we get an ellipse. So, what's the take-home message here? It's that fitting a linear model in a transformed space corresponds to fitting a nonlinear model in the original space. Nice! In general, the transformation isn't always going to be squaring and the boundary isn't always going to be an ellipse. In fact, the new space often has a different number of dimensions from the original space! But this is the basic idea. Kernels and kernel SVMs implement feature transformations in a computationally efficient way.

7. Kernel SVMs

Let's look at some code. We'll need to use scikit-learn's SVC class, rather than LinearSVC, to allow for different kernels. The default behavior is what's called an RBF or Radial Basis Function kernel. Although it's not computed this way, you can think of this as an extremely complicated transformation of the features, followed by fitting a linear boundary in that new space, just like we saw for the simpler squaring transformation. While many nonlinear kernels exist, in this course we'll focus on the RBF. With kernel SVMs, we can call fit and predict in all the usual ways. Let's look at a decision boundary. This is definitely not linear! And, as a result, we've gotten a higher training accuracy than we could have with a linear boundary. We can control the shape of the boundary using the hyperparameters. As usual we have the C hyperparameter that controls regularization. The RBF kernel also introduces a new hyperparameter, gamma, which controls the smoothness of the boundary. By decreasing gamma, we can make the boundaries smoother.

8. Kernel SVMs

The second image shows the same data set with gamma set to point-01. The boundary looks smoother.

9. Kernel SVMs

The third image shows gamma=2. Now we've reached 100% training accuracy by creating a little "island" of blue around each blue training example. In fact, with the right hyperparameters, RBF SVMs are capable of perfectly separating almost any data set. So, why not always use the largest value of gamma and get the highest possible training accuracy? You guessed it: overfitting. In the exercises you'll explore how the kernel hyperparameters affect the tradeoff between training and test accuracy.

10. Let's practice!

Time to experiment with kernel SVMs.