Improve the fit of your models

1. Improve the fit of your models

Using the information we gathered with augment() and glance(), we learned that some of the simple linear regression models do not adequately fit the underlying trends in our data. To overcome this we will now employ a multiple regression model.

2. Multiple Linear Regression model

This model is a natural extension of the simple linear regression model. The key difference is that more than one explanatory variable is used to explain the outcome, meaning that rather than fitting a best fit line we are instead fitting a multi-dimensional plane. In the gapminder dataset we can use additional characteristics or features of our observations to model life expectancy. So, let's use them all.

3. Using all features

The choice of which features to use can be controlled in the formula field of the lm() function. Remember that for a simple model you used the formula of life expectancy as explained by year. Similarly, for a multiple linear regression model you can explicitly define the formula by including the name of each feature separated by a plus sign or if you know you want to include all features you can capture them by using a period, as shown here.

4. Using broom with Multiple Linear Regression models

The behavior of the broom functions remains the same. tidy() returns the coefficient estimates of the models, this now includes estimates for the four additional features. Same goes for augment(), in addition to the fitted values for each observation, the values of four new features are returned. And although the expected output of glance() remains the same we have to shift our focus from the r squared value to the adjusted r squared value when evaluating the fit of our models or comparing simple and multiple linear regression models.

5. Adjusted $R^2$

Remember that R-squared measures the variation explained by the model. Adding any new feature to a model, regardless of its relationship with the dependent variable, will always increase the model's r squared value. This becomes problematic when comparing the fit of models with different number of explanatory features used. To compensate for this you will instead use the Adjusted R-squared value, this is a modified rsquared metric whose calculation takes into account the number of features used in the model. The interpretation of the adjusted R-squared value is very similar to the R-squared and you will use this to evaluate the fit of your new models and compare them to the previously built simple linear models.

6. Let's practice!

So, let's get started.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.