Exploring the credit data

We will be examining the dataset loan_data discussed in the video throughout the exercises in this course.

After being given loan_data, you are particularly interested about the defaulted loans in the data set. You want to get an idea of the number, and percentage of defaults. Defaults are rare, so you always want to check what the proportion of defaults is in a loan dataset. The CrossTable() function is very useful here.

Remember that default information is stored in the response variable loan_status, where 1 represents a default, and 0 represents non-default.

To learn more about variable structures and spot unexpected tendencies in the data, you should examine the relationship between loan_status and certain factor variables. For example, you would expect that the proportion of defaults in the group of customers with grade G (worst credit rating score) is substantially higher than the proportion of defaults in the grade A group (best credit rating score).

Conveniently, CrossTable() can also be applied on two categorical variables. Let's explore!

Get familiar with the dataset by looking at its structure with str().
Load the gmodels package using library(). It is already installed on DataCamp's servers.
Have a look at the CrossTable() of loan status, using just one argument: loan_data$loan_status.
Call CrossTable() with x argument loan_data$grade and y argument loan_data$loan_status. We only want row-wise proportions, so set prop.r to TRUE, but prop.c , prop.t and prop.chisq to FALSE (default values here are TRUE, and this would lead to inclusion of column proportions, table proportions and chi-square contributions for each cell. We do not need these here.)

script.R

R Console

Introduction and data preprocessing

Logistic regression

Decision trees

Evaluating a credit risk model

Exercise

Exercise

Exploring the credit data

Instructions