Get startedGet started for free

The basics of linear regression

1. The basics of linear regression

So, how does linear regression work?

2. Regression mechanics

We want to fit a line to the data, and in two dimensions this takes the form of y equals ax plus b. Using a single feature is known as simple linear regression, where y is the target, x is the feature, and a and b are the model parameters that we want to learn. a and b are also called the model coefficients, or the slope and intercept, respectively. So how do we accurately choose values for a and b? We can define an error function for any given line and then choose the line that minimizes this function. Error functions are also called loss or cost functions.

3. The loss function

Let's visualize a loss function using this scatter plot. We want the line to be as close to the

4. The loss function

observations as possible. Therefore, we want to minimize the vertical distance between the fit and the data. So for each observation,

5. The loss function

we calculate the vertical distance between it and the line.

6. The loss function

This distance is called a residual. We could try to minimize the sum of the residuals,

7. The loss function

but then each positive residual would cancel out

8. Ordinary Least Squares

each negative residual. To avoid this, we square the residuals. By adding all the squared residuals, we calculate the residual sum of squares, or RSS. This type of linear regression is called Ordinary Least Squares, or OLS, where we aim to minimize the RSS.

9. Linear regression in higher dimensions

When we have two features, x1 and x2, and one target, y, a line takes the form y = a1x1 + a2x2 + b. So to fit a linear regression model we specify three variables, a1, a2, and the intercept, b. When adding more features, it is known as multiple linear regression. Fitting a multiple linear regression model means specifying a coefficient, a n, for n number of features, and b. For multiple linear regression models, scikit-learn expects one variable each for feature and target values.

10. Linear regression using all features

Let's perform linear regression to predict blood glucose levels using all of the features from the diabetes dataset. We import LinearRegression from sklearn-dot-linear_model. Then we split the data into training and test sets, instantiate the model, fit it on the training set, and predict on the test set. Note that linear regression in scikit-learn performs OLS under the hood.

11. R-squared

The default metric for linear regression is R-squared, which quantifies the amount of variance in the target variable that is explained by the features. Values can range from zero to one, with one meaning the features completely explain the target's variance. Here are two plots visualizing high and low R-squared respectively.

12. R-squared in scikit-learn

To compute R-squared, we call the model's dot-score method, passing the test features and targets. Here the features only explain about 35 percent of blood glucose level variance.

13. Mean squared error and root mean squared error

Another way to assess a regression model's performance is to take the mean of the residual sum of squares. This is known as the mean squared error, or MSE. MSE is measured in units of our target variable, squared. For example, if a model is predicting a dollar value, MSE will be in dollars squared. To convert to dollars, we can take the square root, known as the root mean squared error, or RMSE.

14. RMSE in scikit-learn

To calculate RMSE, we import mean_squared_error from sklearn-dot-metrics, then call mean_squared_error. We pass y_test and y_pred, and set squared equal to False, which returns the square root of the MSE. The model has an average error for blood glucose levels of around 24 milligrams per deciliter.

15. Let's practice!

Now let's build and evaluate a multiple linear regression model!