Get startedGet started for free

How linear regression works

1. How linear regression works

Let's see how linear regression works. To keep things understandable, we'll stick to simple linear regression, with a single numeric explanatory variable.

2. The standard simple linear regression plot

Here is the standard scatter plot of a simple linear regression.

3. Visualizing residuals

Here's the same plot, showing the residuals. That is, the actual response minus the predicted response. For the best fit, we want those lines to be as short as possible. That is, we want a metric that measures the size of all the residuals, and we want to make that as small as possible.

4. A metric for the best fit

The simplest idea for that metric would be to add up all the residuals. This doesn't work because some residuals are negative, so they would make the total smaller instead of larger. Instead, we do the next easiest thing, which is to square each residual so that they are non-negative, and then add them up. This metric is called the sum of squares. The tricky part is determining which intercept and slope coefficients will result in the smallest sum of squares.

5. A detour into numerical optimization

To solve this problem, we need to take a detour into numerical optimization, which means finding the minimum point of a function. Consider this quadratic equation. The y-value is x-squared minus x plus ten. The plot shows that the minimum point of the function occurs when x is a little above zero and y is a little below ten, but how can we find it exactly?

6. Using calculus to solve the equation

I can solve this with calculus by taking the derivative, setting that derivative to zero, rearranging for x, then substituting back into the original equation to find y. It gets the right answer. x is zero-point-five and y is nine-point-seven-five. However, not all equations can be solved in this analytic fashion, and ever since Newton and Liebniz invented calculus, mathematicians have been trying to find ways of avoiding it. In fact, one of the perks of being a data scientist is that you can just let R figure out how to find the minimum.

7. optim()

To perform numerical optimization in R, you need a function to minimize. In this case, it takes x as an input, and returns x-squared minus x plus ten. Then you call optim. The first argument is an initial guess at the answer. For more complicated functions, this is sometimes important, but here you could pick anything. The second argument is the function to call, without parentheses. In the output, par is the estimate for the x-value. It's close to the correct answer of zero-point-five, and if you need better accuracy, there are many options you can play with to improve the answer. value is the estimated y-value, which is spot on. The other pieces of output are diagnostic values, which we don't need here.

8. Slight refinements

There are two changes to the code which aren't needed here but will assist us in the linear regression case. The function passed to optim is only allowed to have one argument, so to optimize for multiple variables, you need to pass them as a numeric vector. Here, calc_quadratic takes a numeric vector called coeffs, then extracts the first element and calls it x. Secondly, passing a named vector as the initial guess to the par argument of optim makes the output easier to read. Now the par element in the output is named "x".

9. A linear regression algorithm

While lm() is hundreds of lines of code, you can implement simple linear regression for a specific dataset in just a few lines. You define a function that accepts the intercept and slope, and returns the sum of the squares of residuals. You'll have to use the trick of giving the function a single coeffs argument, then extracting the individual numbers. You'll perform the rest of the calculation yourself in the exercises. Then you call optim, passing an initial guess for the coefficients and your sum of squares function. That's it!

10. Let's practice!

Time to delve into linear regression's internals!