Outliers

Now it's time to look at the structure of the variable age. A histogram is plotted on your right. Similar to what you observed in the video for annual income (annual_inc), there is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers. You will look at a scatterplot to verify this. If you find any outliers you will delete them.

If outliers are observed for several variables, it might be useful to look at bivariate plots. It's possible the outliers belong to the same observation. If so, there is even more reason to delete the observation because it is more likely that some information stored in it is wrong.

Build a scatterplot of the variable age (through loan_data$age) using the function plot(). Give the y-axis the appropriate label "Age" using ylab as a second argument.
The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using loan_data$age > 122). Assign it to the object index_highage.
Create a new data set new_data, after removing the observation with the high age using the object index_highage.
Have a look at the bivariate scatterplot, with age on the x-axis and annual income on the y-axis. Change the labels to "Age" and "Annual Income", respectively.

script.R

R Console

Introduction and data preprocessing

Logistic regression

Decision trees

Evaluating a credit risk model

Exercise

Exercise

Outliers

Instructions