1. Model validation, model fit, and prediction
There are several goodness of fit measures used to judge a model's fit. One is the so-called coefficient of determination, or the `Multiple R-squared`.
2. Coefficient of Determination $R^2$
The value of `Multiple R-squared` provides the proportion of the dependent variable's variance that is explained by the regression model, adjusted for the number of variables in the model. Hence, if R^2 equals 0, none of the variation is explained. An R^2 equal to 1 corresponds to a model that explains 100% of the dependent variable's variation. In general I want my R^2 to be as high as possible, but values above 0.9 are rarely reached.
3. $R^2$ and F-test
The F-test is a test for the overall fit of the model. It tests whether or not R^2 is equal to 0. That is to say, at least one regressor (or a set of regressors) has significant explanatory power. In our model, the `p-value` of the F-test is smaller than 0.05, hence, the hypothesis of an R^2 of zero is rejected. The variables included in the model explain some variation of the margin in year 2.
4. Overfitting
So far I have considered only in-sample goodness of fit measures, that is to say, the model is evaluated on the same data that it was fitted on. This bears the risk of overfitting. Overfitting occurs when not only the relation between the variables - shown in blue - is modeled, but also the relation between the errors - shown in red. Then the model performs great when predicting on the dataset it has been fitted on, but the prediction results on new data are poor.
The linear model, at first glance, looks like it does not fit well, but for predictions it will be superior to the more complicated model shown in red.
5. Methods to avoid overfitting
There are several ways to avoid overfitting. One is to keep your model lean. Some measures for the goodness of fit (for example, the AIC) penalize every additional explanatory variable, so that you can control for overfitting while developing a model. When comparing two models, the AIC-minimizing model is preferred.
In `R` you can find the AIC value using the function `AIC()` from the `stats` package. Note that here, since I am not comparing models to each other, I cannot draw any conclusions from an `AIC` of 33950.45.
Automatic model selection can be done using `stepAIC()` from the `MASS` package. More on that in the next chapter on logistic regression.
Other methods to avoid overfitting, like out-of-sample model validation or cross-validation, are also explained in depth in the chapter on logistic regression.
6. New dataset clvData2
Let's turn to prediction. So far I have used explanatory variables in year one in order to explain variation in the margin of year two. Now, I will use explanatory variables in year two in order to predict the margin of year three. Therefore, I will make use of the new dataset `clvData2`.
7. Prediction
Now, prediction is fairly easy. I just hand the model `multipleLM2` and our new dataset `clvData2` over to the `predict()` function. If I store the predictions in a vector, such as `predMargin`, I can use them for further analysis, like calculating a mean, for example.
8. Learnings linear regression
Perfect, you made it through the whole chapter! Check out what you have learned.
9. Learnings from the model
10. Alright, hands on!
And don't forget to practice.