Get startedGet started for free

Assessing model fit with R-squared

1. Assessing model fit with R-squared

Now that you've reviewed the sum of squared residuals with an eye towards model assessment and selection, let's learn about another measure of a model's fit: the widely known R-squared.

2. R-squared

R-squared is another numerical summary of how well a model fits points. It is 1 minus the variance of the residuals over the variance of the outcome variable. If you've never heard of a variance, it’s another measure of variability/spread and its the standard deviation squared. Instead of focusing on the formula however, let's first focus on the intuition: While the sum of squared residuals is unbounded, meaning there is no theoretical upper limit to its value, R-squared is standardized to be between 0 and 1. Unlike the sum of squared residuals where smaller values indicate better fit, larger values of R-squared indicate better fit. So 1 indicates perfect fit, and 0 indicates perfect lack of fit, in other words no relationship between the outcome and explanatory/predictor variables. Let's explore these ideas visually.

3. High R-squared value example

Let's revisit basic regression with one numerical variable and consider a set of points with a perfectly linear relationship. In other words, the points fall perfectly on a line.

4. High R-squared value: "Perfect" fit

Recall residuals are the vertical distances between the observed values, here the black points, and the corresponding fitted/predicted values on the blue regression line. Here, the residuals are all invariably 0. Thus the variance, or variation, of the residuals is 0, and thus R^2 is equal to 1-0, which is 1. Let's now consider an example where the R-squared is closer to 0, indicating a poorer fit.

5. Low R-squared value example

Now the points don't fit tightly on a line, but rather exhibit a large amount of scatter. Let's add the best fitting regression line.

6. Low R-squared value example

Unlike the previous example, there are now varying residuals, thus the numerator is greater than zero, so R-squared will be smaller. Note that it is a mathematical fact that the variance of y is greater than or equal to the variance of the residuals, guaranteeing that R-squared is between 0 and 1.

7. Numerical interpretation

Using this fact, the numerical interpretation of R-squared is as follows: it is the proportion of the total variation in the outcome variable y that the model explains. Our models attempt to explain the variation in house prices. For example, what makes certain houses expensive and others not? The question is, how much of this variation can our models explain? If it's 100%, then our model explains everything! If its 0%, then our model has no explanatory power.

8. Computing R-squared

Let's compute the R-squared statistic for both models we saw in the last video. In both cases, the outcome variable y is the observed log10_price. For model 1, which used log10_size and year, the R^2 is .483 or 48.3%. So you can explain about half the total variation in house prices using Model 1.

9. Computing R-squared

For model 3, which used condition instead of year, the R^2 is .462 or 46.2%. Now a lower proportion of the total variation in house prices is explained by Model 3. Since R-squared values closer to 1 mean better fit, the results suggest you choose Model 1, and thus using size and year is preferred to using size and condition. This is the same conclusion reached as when you used sum of squared residuals as the assessment criteria. Note however, sometimes there are no models that yield R-squared values close to 1. Sometimes the phenomenon you are modeling is so complex, no choice of variables will capture its behavior, and thus you only get low R-squared values.

10. Let's practice!

Let's now compute R-squared for the models you created in the exercises for Chapter 3!