Get startedGet started for free

Understanding distribution of categorical variables

We have looked at the distributions of ApplicantIncome and LoanIncome, now it's time for looking at categorical variables in more details. For instance, let's see whether Gender is affecting the loan status or not. This can be tested using cross-tabulation as shown below:

pd.crosstab( train ['Gender'], train ["Loan_Status"], margins=True)

Next, we can also look at proportions can be more intuitive in making some quick insights. We can do this using the apply function. You can read more about cross tab and apply functions here.


def percentageConvert(ser):
  return ser/float(ser[-1])

pd.crosstab(train ["Gender"], train ["Loan_Status"], margins=True).apply(percentageConvert, axis=1)

This exercise is part of the course

Introduction to Python & Machine Learning (with Analytics Vidhya Hackathons)

View Course

Exercise instructions

  • Use value_counts() with train['LoanStatus'] to look at the frequency distribution
  • Use crosstab with LoanStatus and CreditHistory to perform bi-variate analysis

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Training and Testing dataset are loaded in train and test dataframe respectively

# Approved Loan in absolute numbers
loan_approval = train['Loan_Status'].________()['Y']

# Two-way comparison: Credit History and Loan Status
twowaytable = pd.________(train ["Credit_History"], train ["Loan_Status"], margins=True)


Edit and Run Code