Support vectors

1. Support Vectors

Welcome to the final chapter of the course, where we'll discuss SVMs in more detail. In this first video we'll discuss what support vectors are and why they matter.

2. What is an SVM?

In the last chapter we talked about logistic regression, which is a linear classifier learned with the logistic loss function. Linear SVMs are also linear classifiers, but they use the hinge loss instead. The standard definition of an SVM also includes L2 regularization. Remember these diagrams from Chapter 2? The logistic and hinge losses look fairly similar. A key difference is in the "flat" part of the hinge loss, which occurs when the raw model output is greater than 1, meaning you predicted an example correctly beyond some margin of error. If a training example falls in this "zero loss" region, it doesn't contribute to the fit; if I removed that example, nothing would change. This is a key property of SVMs.

3. What are support vectors?

Support vectors are defined as examples that are NOT in the flat part of the loss diagram. In the figure, support vectors are shown with yellow circles around them. Another way of defining support vectors is that they include incorrectly classified examples, as well as correctly classified examples that are close to the boundary. If you're wondering how close is considered close enough, this is controlled by the regularization strength. Support vectors are the examples that matter to your fit. If an example is not a support vector, removing it has no effect on the model, because its loss was already zero. Even though we use the name "support vectors", it's really the non-support-vectors that are remarkable. Comparing with logistic regression, there is no flat part of the loss there, and therefore all data points matter to the fit. Critical to the popularity of SVMs is that kernel SVMs, coming later in this chapter, are surprisingly fast to fit and predict. Part of the speed comes from clever algorithms whose running time only scales with the number of support vectors, rather than the total number of training examples.

4. Max-margin viewpoint

Although it's not the perspective we've taken in this course, you may encounter the idea that SVMs "maximize the margin". I want to briefly mention this viewpoint for completeness. The diagram shows an SVM fit on a linearly separable dataset. As you can see, the learned boundary falls just half way between the two classes. This is an appealing property: in the absence of other information, this boundary makes more sense than a boundary that is much closer to one class than the other.

5. Max-margin viewpoint

The yellow lines show the distances from the support vectors to the boundary. The length of the yellow lines, which is the same for all 3 cases, is called the margin. If the regularization strength is not too large, SVMs maximize the margin of linearly separable datasets. Unfortunately, most datasets are not linearly separable; in other words, we don't typically expect a training accuracy of 100%. While these max margin ideas can be extended to non-separable data, we won't pursue that avenue here. You can think of this as another view on what we've already defined SVMs to be, which is the hinge loss with L2 regularization. As it turns out, they are mathematically equivalent.

6. Let's practice!

Time play with support vectors.