Understanding distribution of categorical variables
We have looked at the distributions of ApplicantIncome and LoanIncome, now it's time for looking at categorical variables in more details. For instance, let's see whether Gender is affecting the loan status or not. This can be tested using cross-tabulation as shown below:
pd.crosstab( train ['Gender'], train ["Loan_Status"], margins=True)
Next, we can also look at proportions can be more intuitive in making some quick insights. We can do this using the apply function. You can read more about cross tab and apply functions here.
def percentageConvert(ser):
return ser/float(ser[-1])
pd.crosstab(train ["Gender"], train ["Loan_Status"], margins=True).apply(percentageConvert, axis=1)
This exercise is part of the course
Introduction to Python & Machine Learning (with Analytics Vidhya Hackathons)
Exercise instructions
- Use value_counts() with train['LoanStatus'] to look at the frequency distribution
- Use crosstab with LoanStatus and CreditHistory to perform bi-variate analysis
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Training and Testing dataset are loaded in train and test dataframe respectively
# Approved Loan in absolute numbers
loan_approval = train['Loan_Status'].________()['Y']
# Two-way comparison: Credit History and Loan Status
twowaytable = pd.________(train ["Credit_History"], train ["Loan_Status"], margins=True)