Get startedGet started for free

Properly Training a Model

1. Properly Training a Model

In this lesson, you will learn best practices for training and evaluating a regression model.

2. Models can perform much better on training than they do on future data.

In general, a model performs better on its own training data than on data it hasn't yet seen. For simple models like linear regression, this upward bias is often not severe; but for more complex models, or even for a linear model with too many variables, using only the training data to evaluate the model can produce misleading results. Here, we see a model that got an R-squared of 0.9 on its training data, but an R-squared of only 0.15 on new data. This model was overfit.

3. Test/Train Split

When you have a lot of data, the best thing to do is to split your data into two: one set to train the model, and another set to test it.

4. Example: Model Female Unemployment

For this example, we’ll use data from the World Bank about male and female unemployment rates in North America from 1991-2014. We’ll predict female unemployment rates from male unemployment. We have 96 rows of data, which we randomly split into a training set of 66 rows, and a test set of 30 rows.

5. Model Performance: Train vs. Test

We can fit a linear model to the training data, and then calculate RMSE and R-squared of the model for both the training and test sets. Here the model performs similarly on both sets, slightly better on training. Since the performance is not too different on test, we know the model is not overfit.

6. Cross-Validation

If you don’t have enough data to split into training and test sets, use cross-validation to estimate a model’s out of sample performance. In n-fold cross-validation, you partition the data into n-subsets (in the figure we use n = 3). Let’s call them A, B, and C.

7. Cross-Validation

First, train a model using the data from sets A and B, and use that model to make predictions on C.

8. Cross-Validation

Then train a model on B and C to predict on A

9. Cross-Validation

And a model on A and C to predict on B. Now, none of these models has made predictions on its own training data. All the predictions are essentially "test set" predictions. Therefore, RMSE and R-squared calculated from these predictions should give you an unbiased estimate of how a model fit to all the training data will perform on future data.

10. Create a cross-validation plan

We can implement a cross-validation plan using the function kWayCrossValidation from the package vtreat. The function needs the number of rows in the training data, and the number of folds to create. In our example, we use 3 folds.

11. Create a cross-validation plan

The function returns the indices for training and testing for each fold. Use the data with the training indices to fit a model, and then make prediction on the data with the app, or test, indices.

12. Final Model

If the estimated model performance looks good enough, then use all the data to fit a final model. You can’t evaluate this final model’s future performance, because you don’t have data to evaluate it with. Cross-validation only tests the modeling process, while test/train split tests the final model.

13. Example: Unemployment Model

We can use cross-validation to estimate the out of sample performance of a model to fit female unemployment. The estimates are similar to the estimates made by using a test set. Now let's practice

14. Let's practice!

test/train splits and cross-validation.