1. Using the R Squared statistic
So far in the course, we've talked about how to estimate and modify a model. In the final two chapters, we'll be looking at how the models we estimate can be evaluated and used to make predictions. In this chapter, we'll focus on the first part: evaluating models. This is a critically important part of modeling building because if our model doesn't fit, then we won't be able to make very good predictions. We'll start by using one of the most common measures of model fit in linear regression: the r-squared statistic.
2. What is R squared?
The R squared statistic is a measure of how well the independent variables in the model are able to predict the dependent variable. Specifically, the R squared statistic measures the proportion of variance in the dependent variable that can be explained by the independent variables. As with all proportions, the R squared ranges from 0 to 1, with 0 representing no variance explained, and 1 representing all the variance explained, or a deterministic model. Because of this, the R squared is also known as the coefficient of determination.
The R squared is calculated as 1 minus the sum of squared residuals (called the residual sum of squares) divided by the sum of squared deviations of the data from the mean (called the total sum of squares).
3. What is R squared?
In other words, we take the observed value for an observation and subtract the predicted value, square it, and then sum that over all observations. That is the numerator.
4. What is R squared?
In the denominator, we take the observed value for an observation, subtract the mean of the observed value, square that, and sum over all observations. This variance of residuals to total variance ratio is what drives the R squared.
5. Calculating R squared statistic
In a frequentist regression, we can get the R squared by estimating a model with the lm() function, and saving a summary of the object. That object contains an `r.squared`, that we can pull out to view.
However, we can also calculate this by hand. We can define the residual sum of squares as the variance of the residuals of lm_model and the total sum of squares as the sum of the variance of the residuals of lm_model and the variance of the predicted values of lm_model. Taking 1 minus the residual sum of square divided by the total sum of squares, gives us the same value that was saved in the lm_summary.
6. The R squared statistic of a Bayesian Model
In rstanarm, the R squared is not saved in the summary object as in the lm() summary function. However, we can still calculate the R squared by hand using the exact same formulas as we did for the frequentist regression. We define the residual sum of squares and the total sum of squares, and then use those to calculate the R squared value. Comparing this value to the value we got in the frequentist regression, we see that they are almost identical.
7. Let's practice!
Now it's your turn to calculate the R squared for our model using Spotify data.