What is a loss function?

1. What is a loss function?

In this video, we'll kick off our discussion of loss functions. Many machine learning algorithms involve minimizing a loss, and by understanding this perspective you'll be equipped with the tools to see connections between models, quickly grasp new ones, and start tailoring them to your data science problem.

2. Least squares: the squared loss

We have actually seen loss functions before, in the prerequisite course on supervised learning. For example, least squares linear regression, such as scikit-learn's LinearRegression class, minimizes the sum of squares of the errors made on your training set. Here, error is defined as the difference between the true target value and the predicted target value. You can think of minimizing the loss as jiggling around the coefficients, or parameters of the model until this error term, or loss function, is as small as possible. In other words, the loss function is a penalty score that tells us how well (or, to be precise, how poorly) the model is doing on the training data. We can think of the "fit" function as running code that minimizes the loss. Note that the score function provided by scikit-learn isn't necessarily the same thing as the loss function. The loss is used to fit the model on the data, and the score is used to see how well we're doing. It's intuitive that these would be the same, and they often are, but you should be aware that this isn't guaranteed.

3. Classification errors: the 0-1 loss

The squared error from LinearRegression is not appropriate for classification problems, because our y-values are categories, not numbers. For classification, a natural quantity to think about is the number of errors we've made. Since we'd like to make this as small as possible, the number of errors might be a good loss function. We'll refer to this loss function as the 0-1 loss, because it's defined to be either 0 (if your prediction is correct) or 1 (if your prediction is wrong). By summing this function over all training examples, we get the number of mistakes we've made on the training set, since we add 1 to the total for each mistake. While the 0-1 loss is important for our conceptual journey, it turns out to be very hard to minimize it directly in practice, which is why logistic regression and SVMs don't use it. The reasons for this are beyond the scope of the course.

4. Minimizing a loss

In the exercises you'll try minimizing a loss function using a Python package called scipy-dot-optimize-dot-minimize, which can minimize all sorts of functions. Let's try it out. Here, I'll minimize the function y=x^2, which is computed using numpy.square. The second argument is our initial guess. Let's try zero. Finally, I have "dot x" at the end to grab the input value that makes the function as small as possible. We got zero as a result because this function is minimized when x=0. But that's not too interesting, since our initial guess was the correct answer! It's correct because something squared can only be zero or more, thus, its smallest possible value is attained when x=0. Let's try another initial guess to see if it's actually doing something. What we see is a very small number, near 10 to the power of -8. This is normal for numerical optimization: we don't expect exactly the right answer, but something very close. In the exercises, you'll minimize the squared error from linear regression. The inputs will be the model coefficients. So, you can think of the code as answering the question, "what values of the model coefficients make my squared error as small as possible?" That's exactly what linear regression is doing.

5. Let's practice!

Time for a couple exercises on loss functions.