Get startedGet started for free

Histograms and outliers

1. Histograms and outliers

Now let's look at some continuous variables using histograms and plots.

2. Using function hist()

For a basic histogram, you call function hist() with the variable of interest, in this case, interest rate.

3. Using function hist()

You can use the arguments main and xlab for nicer labels. The frequencies for the variable of interest are shown on the y-axis. Here, you can see that all loans had an interest rate of over 5 percent, and very few loans had an interest rate higher than 20%.

4. Using function hist() on annual_inc

Let's have a look at the histogram of annual income. We notice that we get a strange result here, with seemingly just one big bar.

5. Using function hist() on annual_inc

Storing the histogram in hist_income and using dollar sign breaks, we get information on the location of the histogram breaks. In order to get a clear idea on the data structure,

6. The breaks-argument

you can change the number of breaks using the breaks argument, such that you get a more intuitive plot. This can be done by choosing a number that seems more appropriate or use a rule of thumb, such as the square root of the number of observations in the data set. This results in a much longer vector of breaks. However, the result still doesn't look very nice here, with a lot of blank space. The x-axis of the histogram automatically ranges from the smallest observed value to the largest one.

7. annual_inc

Let us look at a scatterplot to see what is going on. In this plot, the annual income is shown on the y-axis, and the observation's index number is shown on the x-axis.

8. annual_inc

We see that there is one huge salary of 6 million dollars where none of the others is bigger than around 2 million dollars. We consider this an outlier.

9. Outliers

In statistics, an outlier is an observation that is abnormally distant from other values. But when is a distance abnormal? In general, data scientists will use their expert judgment, rules of thumb, or a combination of both.

10. Expert judgment

Expert judgment could be used if the data scientist is considered an expert in the domain of credit risk modeling. He can then judge that an annual wage above, let's say, 3 million dollars must be an error and should be deleted from the data set.

11. Rule of thumb

If a data scientist wants to rely on a rule of thumb, he could delete all values that are bigger or smaller than 1.5 times the interquartile range, which is the range between the first and the third quartiles of the variable's distribution. As outliers in the negative range did not occur here, we only delete ones in the positive range.

12. Histograms

After deleting outliers, you get the following results. These histograms are more informative than the initial ones, including the outliers, especially the histogram that was constructed using the rule of thumb. Note that quite some observations were deleted using this rule of thumb. Even if you don't plan to actually leave out these outliers in your analysis, it might still be useful to delete them temporarily when visualizing the data.

13. Bivariate plot

Let's conclude by looking at a bivariate plot. When you include a second variable in the plot() function, the first argument will be plotted on the x-axis and the second argument on the y-axis. A bivariate plot for employment length and annual income is shown here. Having a look at bivariate plots can be interesting to track bivariate outliers, which are outliers on two dimensions of the data.

14. Bivariate plot

For the combination plotted here, we only see an outlier on the scale of annual income and not for employment length.

15. Let's practice!

Now let's try to make some plots and histograms ourselves.