Outliers
Now it's time to look at the structure of the variable age
. A histogram is plotted on your right. Similar to what you observed in the video for annual income (annual_inc
), there is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers. You will look at a scatterplot to verify this. If you find any outliers you will delete them.
If outliers are observed for several variables, it might be useful to look at bivariate plots. It's possible the outliers belong to the same observation. If so, there is even more reason to delete the observation because it is more likely that some information stored in it is wrong.
This exercise is part of the course
Credit Risk Modeling in R
Exercise instructions
- Build a scatterplot of the variable
age
(throughloan_data$age
) using the functionplot()
. Give the y-axis the appropriate label"Age"
usingylab
as a second argument. - The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using
loan_data$age > 122
). Assign it to the objectindex_highage
. - Create a new data set
new_data
, after removing the observation with the high age using the objectindex_highage
. - Have a look at the bivariate scatterplot, with age on the x-axis and annual income on the y-axis. Change the labels to
"Age"
and"Annual Income"
, respectively.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Plot the age variable
# Save the outlier's index to index_highage
# Create data set new_data with outlier deleted
new_data <- loan_data[-___, ]
# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "___", ylab = "___")