Validation set prediction framework

1. Validation set prediction framework

In this lesson I'll introduce the "validation set" prediction framework. This framework allows us to get a sense of how well a predictive model will perform, in our case, on new, previously unseen houses. This forms the backbone of a well-known machine learning method for model assessment called cross-validation.

2. Validation set approach

The underlying idea of the validation set approach is to first fit, or "train" a model on one set of data, but evaluate, or “validate" its performance on a different set of data. If you used the same data to both train your model and evaluate its performance, you could imagine your model easily being “overfit” to this data. In other words, you’d construct a model that's so overly specific to one dataset that it wouldn't generalize well to other datasets.

3. Training/test set split

Say your dataset has n observations. You randomly split the data into two sets: a training set in blue and a test set in orange. I'll use the blue observations to train or fit our model, then apply the model to get predictions y-hat for the orange observations. Then for these orange observations again, you'll assess these predictions y-hat by comparing them to the observed outcome variables y. By using independent training and test data as above, you can get a sense of a model's predictive performance on "new" data.

4. Training/test set split in R

Let's do this with some nifty dplyr functions. You first use sample_frac() with size set to 1 and replace set to FALSE to randomly sample 100% of the rows of house-prices without replacement. This has the effect of randomly shuffling the order of the rows. You then set the training data to be the 1st 10k rows of house-prices-shuffled using slice(). You similarly set the test data to be the remaining 11613 rows. Note that these two datasets have none of the original rows in common, and by randomly shuffling the rows before the slicing, we’ve effectively randomly assigned the rows to train and test. Also you're not limited to a rough 50/50 split between train and test as I just did; I only did this for simplicity.

5. Training models on training data

Let's fit the same regression model as earlier using log10_size and year as predictor variables, however we set the data to be train and not house-prices. Let's then output the regression table. You again obtain values for the intercept and the slopes for log10_size and year in the estimate column, but these values are slightly different than before when using all of the house-prices data, as they are based on a randomly chosen subset of points in the training data.

6. Making predictions on test data

Let's then apply this model to the test data to make predictions. In other words, take all 11613 houses in test and compute the predicted values log10-price-hat. Recall from earlier that you can do this quickly by using the get_regression_points() function with the newdata argument set to test. You observe a log10_price_hat column of predicted values and the corresponding residuals. Note that since you have both the predicted values y-hat, in this case log10-price-hat, AND the observed values y, in this case log10-price, you can compute the residuals in the final column.

7. Assessing predictions with RMSE

Let's now compute the root mean square error to assess our predictions as before: you first mutate() a new column of the squared residuals, then you summarize() these values with the square root of their mean, in this case in a single mutate step. The RMSE is 0.165. Let's now repeat this for the model that used condition instead of year and compare the RMSEs.

8. Comparing RMSE

You again fit the model to the training data and then use the get_regression_points() function with the newdata argument again set to test to make predictions, and compute the RMSE. This RMSE of 0.168 is larger than the previous one of 0.165, suggesting that using condition instead of year yields worse predictions.

9. Let's practice!

Now it's your turn. Let's use the validation set prediction framework to compare the predictive abilities of our other models for log10-price.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.