Get startedGet started for free

Model validation

1. Model validation

In marketing practice, the goal is to make predictions. One way to assess how accurately a predictive model will perform in practice is to evaluate its performance in predicting the outcome of interest on an independent dataset. This is the topic of the last lesson.

2. Subsetting

In a prediction setting the data is usually partitioned into a trainings and a test set. The test set, also called validation or holdout sample, is typically smaller than the trainings set. In our example, we withhold the last purchase that was recorded for each household to build our trainings set. Likewise, the test set consists of these last purchases. We partition the original choice-dot-data set into subsamples by using the function subset(). The subset function selects observation rows of a given data set by providing a pre-specified SUBSET operation. For the trainings set, we use all observations where the column entries of LASTPURCHASE equal 0. We name this subset train-dot-data. Similarly, for creating the test set, we use all observations where the column entries of LASTPURCHASE equal 1, and name this subset test-dot-data. We check the size of the test and the trainings set by using the function dim(). Both samples have the same number of columns but the test set consists only of 300 observations, which corresponds to approximately 10% of the full choice-dot-data set.

3. Model training

In the validation process, the response model is estimated on the trainings dataset, and tested on the test dataset. The goal is to test the model’s ability to predict an independent data that was not used for estimation in order to give insights on how the model will generalize to future data. In the next step, we estimate a logistic response model using the predictors that remained after model selection in the previous lesson. Based on the trainings dataset we explain the purchases of HOPPINESS by the changes in the price-dot-ratio, FEATURE activities and the combination of FEATURE and DISPLAY activities using the function glm(). Again, we investigate the estimated coefficients in terms of their marginal effects. The coefficients are quite similar in size to those obtained by the extended-dot-model fitted on the full dataset.

4. Out-of-sample testing

The trained model is then used to predict the initially defined test dataset. The goal is to estimate how well the model fits to a dataset that is independent of the data that were used for estimation. This involves using the model estimates of the estimated model to predict the values on the test data set. We can do this by using the function predict(). The predict function produces the predicted values for the test dataset by evaluating the fitted coefficients in the trained model object. To obtain predictions on the scale of the response variable, which are the predicted purchase probabilities, we set the additional type argument to "response". Afterward, the predictions are compared to the true values of each observation. Again we built a binary classifier by classifying all predicted purchase probabilities with a value greater than the cutoff value 0-point-5 as 1, and 0 otherwise, by using the function ifelse(). To evaluate the performance of the model on the out-of-sample observations we cross-tabulate the observed purchases of the test dataset, stored in observed, against the classified purchases, stored in predicted, by using the function table(). Our model correctly classified more than 92 percent of the cases, which is quite good.

5. Let's practice!

Time for the last exercise! I hope you enjoyed the course!