Exploring the credit data
We will be examining the dataset loan_data
discussed in the video throughout the exercises in this course.
After being given loan_data
, you are particularly interested about the defaulted loans in the data set. You want to get an idea of the number, and percentage of defaults. Defaults are rare, so you always want to check what the proportion of defaults is in a loan dataset. The CrossTable()
function is very useful here.
Remember that default information is stored in the response variable loan_status
, where 1 represents a default,
and 0 represents non-default
.
To learn more about variable structures and spot unexpected tendencies in the data, you should examine the relationship between loan_status
and certain factor
variables. For example, you would expect that the proportion of defaults in the group of customers with grade
G (worst credit rating score) is substantially higher than the proportion of defaults in the grade
A group (best credit rating score).
Conveniently, CrossTable()
can also be applied on two categorical variables. Let's explore!
This exercise is part of the course
Credit Risk Modeling in R
Exercise instructions
- Get familiar with the dataset by looking at its structure with
str()
. - Load the gmodels package using
library()
. It is already installed on DataCamp's servers. - Have a look at the
CrossTable()
of loan status, using just one argument:loan_data$loan_status
. - Call
CrossTable()
withx
argumentloan_data$grade
andy
argumentloan_data$loan_status.
We only want row-wise proportions, so setprop.r
toTRUE
, butprop.c
,prop.t
andprop.chisq
toFALSE
(default values here areTRUE
, and this would lead to inclusion of column proportions, table proportions and chi-square contributions for each cell. We do not need these here.)
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# View the structure of loan_data
# Load the gmodels package
# Call CrossTable() on loan_status
# Call CrossTable() on grade and loan_status