Get startedGet started for free

Model selection and assessment

1. Model assessment and selection

Let's recap what we've learned so far. After learning some background modeling theory and terminology in Chapter 1, in Chapter 2 you modeled basic regressions using one explanatory/predictor X variable. In Chapter 3, you extended this by using two X variables. You created many models for both teaching score and house price. However, you may be asking: how do you know which model to choose, in other words, which models are best? What do we mean by best and how does one assess this? In this final chapter, you'll answer these questions via elementary model assessment and selection. In particular, you'll assess the quality of the multiple regression models for Seattle house prices from Chapter 3. But first, a brief refresher.

2. Refresher: Multiple regression

In Chapter 3 you studied two different multiple regression models for the outcome variable log10_price. The first using two numerical explanatory/predictor X variables: log10_size and year. The other used one numerical and one categorical X variable: log10_size and condition. If you wanted to explain or predict house prices, and you had to choose from these two models, which one would you select? Presumably the "better" one. As suggested earlier, this necessitates an explicit criteria for “better.” Have you seen one so far? Yes, the sum of squared residuals!

3. Refresher: Sum of squared residuals

Recall a residual is an observed value y minus, its corresponding fitted/predicted value y-hat, in our case log10_price minus log10_price_hat. Visually, they are the vertical distances between the blue points and their corresponding fitted value on the regression plane. I've marked a small selection on the snapshot of the 3D visualization. Furthermore, you learned that of all possible planes, the regression plane minimizes the sum of squared residuals. The latter is computed by squaring all 21k residuals and summing them. You saw that this quantity can thought of as a measure of lack-of-fit, where larger values indicate a worse fit.

4. Refresher: Sum of squared residuals

You computed this value explicitly in a previous video for model_price_1, which uses log10_size and year as x variables. You saw that this model's sum of squared residuals was 585, a number that's a bit hard to make sense of on its own.

5. Refresher: Sum of squared residuals

However, let's compute the sum of squared residuals for model_price_3 as well, which uses the categorical variable condition instead of the numerical variable year. The sum of squared residuals is now 608. So it seems that model 3 using the variable condition has a bigger lack-of-fit, so is "worse" suggesting that model 1 using year is better.

6. Let's practice!

Your turn. Let's compute the sum of squared residuals like you did in Chapter 3, but now with an eye towards model assessment and selection.