Get startedGet started for free

Motivating the RBF kernel

1. Motivating the RBF kernel

In this lesson, we'll start by using a polynomial SVM to classify the complex dataset we generated in the previous lesson. We will see that the polynomial kernel doesn't work well , indicating that we need a more flexible kernel. We will consider what flexibility means in the context of a classification problem, which give us an intuitive motivation for the RBF kernel.

2. Quadratic kernel (default parameters)

The model building process should now be familiar: we partition the data into training and test sets using the usual 80/20 split and then build an SVM using a polynomial kernel of degree 2 with default values for parameters. Here is the code. The number of support vectors is a little over a third of the dataset and the accuracy is actually not too bad. However, before we jump to conclusions, let's have a look at the plot.

3. Plot: quadratic kernel, default params

OK, so we see that this kernel does not do a very good job because it attempts to fit a circular boundary to what is actually a figure of 8. A quadratic or 2nd degree curve is simply not flexible enough to capture the complexity of this boundary. But perhaps we can do better by trying a higher order polynomial.

4. Try higher degree polynomial

Let's try a higher degree polynomial. We can rule out odd degree polynomials because we know the decision boundary is symmetric about the x1 and x2 axes. On trying a polynomial kernel of degree 4 we see that the accuracy remains much the same as for the quadratic case. Let's see what the plot of the predicted decision boundary looks like.

5. Plot: polynomial kernel, degree 4

The plot looks much the same as the quadratic case. If you try higher degree polynomials, you will see that the story does not change much: the kernels simply cannot capture the figure of 8 shape well enough. Increasing the degree of the polynomial does not help and neither does tuning (try it!). Clearly, another approach is required.

6. Another approach

OK, so instead of trying to choose a kernel that reproduces the boundary, lets try another approach. Let's use the heuristic that points that lie close to each other tend to belong to the same class. You might recognize that this is exactly the intuition behind the k-nearest neighbors algorithm. What would such a kernel look like? If we single out a point in the dataset, say capital X1 with coordinates (a, b), the kernel should have a maximum at X1 and should decrease in value as one moves away from it. Further, in absence of any other information, the decrease in value should be isotropic, that is, the same in all directions. Furthermore, the decay rate, gamma, should be tunable. A simple function that has these properties is the exponential, or Gaussian, radial basis function e to the minus gamma times r, where r is the distance between X1 and any other point in the dataset X.

7. How does the RBF kernel vary with gamma (code)

How does gamma vary with r? Here's some ggplot code to visualize the RBF kernel for r ranging from 0 to 10 for various values of gamma.

8. How does the RBF kernel vary with gamma (plot)

...and here's the plot. Recall that r is the distance between the point at which the kernel is centered and any other point in the dataset and that the value of the kernel is a measure of the influence that the points have on each other. The plot clearly shows that for a given set of points (i.e., fixed r), the influence they have on each other decreases with increasing gamma.

9. Time to practice!

In the next lesson, we'll use this kernel to build models, but first let's do some exercises.