How linear regression works

1. How linear regression works

Let's see how linear regression works. To keep things understandable, we'll stick to simple linear regression, with a single numeric explanatory variable.

2. The standard simple linear regression plot

Here is the standard scatter plot of a simple linear regression.

3. Visualizing residuals

Here's the same plot, showing the residuals. That is, the actual response minus the predicted response. For the best fit, we want those red lines to be as short as possible. That is, we want a metric that measures the size of all the residuals, and we want to make that as small as possible.

4. A metric for the best fit

The simplest idea for that metric would be to add up all the residuals. This doesn't work because some residuals are negative, so they would make the total smaller instead of larger. Instead, we do the next easiest thing, which is to square each residual so that they are non-negative, and then add them up. This metric is called the sum of squares. The tricky part is determining which intercept and slope coefficients will result in the smallest sum of squares.

5. A detour into numerical optimization

To solve this problem, we need to take a detour into numerical optimization, which means finding the minimum point of a function. Consider this quadratic equation. The y-value is x-squared minus x plus ten. The plot shows that the minimum point of the function occurs when x is a little above zero and y is a little below ten, but how can we find it exactly?

6. Using calculus to solve the equation

It's possible to solve this with calculus. Don't worry if this doesn't make sense, you won't need it for the exercises. You find the minimum by taking the derivative, setting that derivative to zero, rearranging for x, then substituting back into the original equation to find y. It gets the right answer. x is zero-point-five and y is nine-point-seven-five. However, not all equations can be solved in this analytic fashion, and ever since Newton and Liebniz invented calculus, mathematicians have been trying to find ways of avoiding it. In fact, one of the perks of being a data scientist is that you can just let Python figure out how to find the minimum.

7. minimize()

To perform numerical optimization in Python, you can use the optimize package from scipy. In this case, you'll need the minimize function. The function you want to minimize is defined here. It takes x as an input, and returns y as x-squared minus x plus ten. Then you call minimize. The first argument is the function to call, without parentheses. The second argument is an initial guess at the answer. For more complicated functions, this is sometimes important, but here you could pick anything. In the output, you see fun, the estimated y-value of the function, which is spot on. The x-value is at the bottom. It's close to the correct answer of zero-point-five, and if you need better accuracy, there are many options you can play with to improve the answer. The other pieces of output are diagnostic values, which we don't need here.

8. A linear regression algorithm

While ols() is hundreds of lines of code, you can implement simple linear regression for a specific dataset in just a few lines. You define a function that accepts the intercept and slope, and returns the sum of the squares of residuals. You'll have to use the trick of giving the function a single coeffs argument, then extracting the individual intercept and slope. You'll perform the rest of the calculation yourself in the exercises. Then you call minimize, passing an initial guess for the coefficients and your sum of squares function. That's it!

9. Let's practice!

Time to delve into linear regression's internals!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.