Replacing missing credit data
Now, you should check for missing data. If you find missing data within loan_status
, you would not be able to use the data for predicting probability of default because you wouldn't know if the loan was a default or not. Missing data within person_emp_length
would not be as damaging, but would still cause training errors.
So, check for missing data in the person_emp_length
column and replace any missing values with the median.
The data set cr_loan
has been loaded in the workspace.
This exercise is part of the course
Credit Risk Modeling in Python
Exercise instructions
- Print an array of column names that contain missing data using
.isnull()
. - Print the top five rows of the data set that has missing data for
person_emp_length
. - Replace the missing data with the median of all the employment length using
.fillna()
. - Create a histogram of the
person_emp_length
column to check the distribution.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print a null value column array
print(____.columns[____.____().any()])
# Print the top five rows with nulls for employment length
print(____[____[____].____()].head())
# Impute the null values with the median value for all employment lengths
____[____].____((cr_loan['person_emp_length'].____()), inplace=True)
# Create a histogram of employment length
n, bins, patches = plt.____(____[____], bins='auto', color='blue')
plt.xlabel("Person Employment Length")
plt.____()