Data splitting and confusion matrices

1. Data splitting and confusion matrices

We have seen several techniques for preprocessing the data. When the data is fully preprocessed, you can go ahead and start your analysis.

2. Start analysis

You can run the model on the entire data set, and use the same data set for evaluating the result, but this will most likely lead to a result that is too optimistic.

3. Training and test set

One alternative is to split the data into two pieces. The first part of the data, the so-called training set, can be used for building the model, and the second part of the data, the test set, can be used to test the results.

4. Training and test set

One common way of doing this is to use two-thirds of the data for a training set and one-third of the data for the test set. Of course, there can be a lot of variation in the performance estimate, depending on which two-thirds of the data you select for the training set. One way to reduce this variation is by using cross-validation.

5. Cross-validation

For the two-thirds training set and one-third test set example, a cross-validation variant would look like this. The data would be split in three equal parts, and each time, two of these parts would act as a training set, and one part would act as a test set. Of course, we could use as many parts as we want, but we would have to run the model many times if using many parts. This may become computationally heavy. In this course, we will just use one training set and one test set containing two-thirds versus one-third of the data, respectively. Imagine we have just run a model, and now we apply the model to our test set to see how good the results are.

6. Evaluate a model

Evaluating the model for credit risk means comparing the observed outcomes of default versus non-default, stored in the loan_status variable of the test set, with the predicted outcomes according to the model. If we are dealing with a large number of predictions, a popular method for summarizing the results uses something called a confusion matrix. Here, we use just 14 values to demonstrate the concept.

7. Evaluate a model

A confusion matrix is a contingency table of correct and incorrect classifications. Correct classifications are on the diagonal of the confusion matrix. We see, for example, that eight non-defaulters were correctly classified as non-default, and three defaulters were correctly classified as defaulters. However, we see that two non-defaulters where wrongly classified as defaulters, and one defaulter was wrongly classified as a non-defaulter.

8. Evaluate a model

The items on the diagonals are also called the true positives and true negatives. The off-diagonals are called the false positives versus the false negatives.

9. Some measures...

Several measures can be derived from the confusion matrix. We will discuss the classification accuracy, the sensitivity, and the specificity. The classification accuracy is the percentage of correctly classified instances, which is equal to 78-point-57% in this example. The sensitivity is the percentage of bad customers that are classified correctly or 75% in this example. The specificity is the percentage of good customers that are classified correctly or 80% in this example.

10. Let's practice!

Let's practice splitting the data and constructing confusion matrices.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.