Understanding distribution of categorical variables

We have looked at the distributions of ApplicantIncome and LoanIncome, now it's time for looking at categorical variables in more details. For instance, let's see whether Gender is affecting the loan status or not. This can be tested using cross-tabulation as shown below:

pd.crosstab( train ['Gender'], train ["Loan_Status"], margins=True)

Next, we can also look at proportions can be more intuitive in making some quick insights. We can do this using the apply function. You can read more about cross tab and apply functions here.


def percentageConvert(ser):
  return ser/float(ser[-1])

pd.crosstab(train ["Gender"], train ["Loan_Status"], margins=True).apply(percentageConvert, axis=1)

Use value_counts() with train['LoanStatus'] to look at the frequency distribution
Use crosstab with LoanStatus and CreditHistory to perform bi-variate analysis

Introduction to Python for Data Analysis

Python Libraries and data structures

Exploratory analysis in Python using Pandas

Data Munging in Python using Pandas

Building a Predictive model in Python

Expert advice to improve model performance

Exercise

Understanding distribution of categorical variables

Instructions