Get startedGet started for free

Outliers

Now it's time to look at the structure of the variable age. A histogram is plotted on your right. Similar to what you observed in the video for annual income (annual_inc), there is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers. You will look at a scatterplot to verify this. If you find any outliers you will delete them.

If outliers are observed for several variables, it might be useful to look at bivariate plots. It's possible the outliers belong to the same observation. If so, there is even more reason to delete the observation because it is more likely that some information stored in it is wrong.

This exercise is part of the course

Credit Risk Modeling in R

View Course

Exercise instructions

  • Build a scatterplot of the variable age (through loan_data$age) using the function plot(). Give the y-axis the appropriate label "Age" using ylab as a second argument.
  • The oldest person in this data set is older than 122 years! Get the index of this outlier using which() and the age of 122 as a cutoff (you can do this using loan_data$age > 122). Assign it to the object index_highage.
  • Create a new data set new_data, after removing the observation with the high age using the object index_highage.
  • Have a look at the bivariate scatterplot, with age on the x-axis and annual income on the y-axis. Change the labels to "Age" and "Annual Income", respectively.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Plot the age variable


# Save the outlier's index to index_highage


# Create data set new_data with outlier deleted
new_data <- loan_data[-___, ]

# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "___", ylab = "___")
Edit and Run Code