1. The basics of linear regression
So, how does linear regression work?
2. Regression mechanics
We want to fit a line to the data, and in two dimensions this takes the form of y equals ax plus b. Using a single feature is known as simple linear regression, where y is the target, x is the feature, and a and b are the model parameters that we want to learn. a and b are also called the model coefficients, or the slope and intercept, respectively.
So how do we accurately choose values for a and b?
We can define an error function for any given line and then choose the line that minimizes this function. Error functions are also called loss or cost functions.
3. The loss function
Let's visualize a loss function using this scatter plot. We want the line to be as close to the
4. The loss function
observations as possible. Therefore, we want to minimize the vertical distance between the fit and the data. So for each observation,
5. The loss function
we calculate the vertical distance between it and the line.
6. The loss function
This distance is called a residual.
We could try to minimize the sum of the residuals,
7. The loss function
but then each positive residual would cancel out
8. Ordinary Least Squares
each negative residual.
To avoid this, we square the residuals. By adding all the squared residuals, we calculate the residual sum of squares, or RSS.
This type of linear regression is called Ordinary Least Squares, or OLS, where we aim to minimize the RSS.
9. Linear regression in higher dimensions
When we have two features, x1 and x2, and one target, y, a line takes the form y = a1x1 + a2x2 + b.
So to fit a linear regression model we specify three variables, a1, a2, and the intercept, b.
When adding more features, it is known as multiple linear regression. Fitting a multiple linear regression model means specifying a coefficient, a n, for n number of features, and b.
For multiple linear regression models, scikit-learn expects one variable each for feature and target values.
10. Linear regression using all features
Let's perform linear regression to predict blood glucose levels using all of the features from the diabetes dataset.
We import LinearRegression from sklearn-dot-linear_model.
Then we split the data into training and test sets, instantiate the model, fit it on the training set, and predict on the test set.
Note that linear regression in scikit-learn performs OLS under the hood.
11. R-squared
The default metric for linear regression is R-squared, which quantifies the amount of variance in the target variable that is explained by the features.
Values can range from zero to one, with one meaning the features completely explain the target's variance.
Here are two plots visualizing high and low R-squared respectively.
12. R-squared in scikit-learn
To compute R-squared, we call the model's dot-score method, passing the test features and targets. Here the features only explain about 35 percent of blood glucose level variance.
13. Mean squared error and root mean squared error
Another way to assess a regression model's performance is to take the mean of the residual sum of squares. This is known as the mean squared error, or MSE.
MSE is measured in units of our target variable, squared. For example, if a model is predicting a dollar value, MSE will be in dollars squared.
To convert to dollars, we can take the square root, known as the root mean squared error, or RMSE.
14. RMSE in scikit-learn
To calculate RMSE, we import mean_squared_error from sklearn-dot-metrics, then call mean_squared_error. We pass y_test and y_pred, and set squared equal to False, which returns the square root of the MSE. The model has an average error for blood glucose levels of around 24 milligrams per deciliter.
15. Let's practice!
Now let's build and evaluate a multiple linear regression model!