1. Missing data and coarse classification
Now,
2. Outlier deleted
we have removed the observation containing a bivariate outlier for age and annual income from the dataset.
3. Missing inputs
What we did not discuss before is that there are missing inputs (or NAs, which stand for not available) for two variables: employment length and interest rate. In this video, we will demonstrate some methods for handling missing data on the employment length variable. You'll practice this newly gained knowledge yourself on the variable interest rate.
4. Missing inputs
First, you want to know how many inputs are missing, as this will affect what you do with them. A simple way of finding out is with the function summary(). If you do this for employment length, you will see that there are 809 NAs.
5. Missing inputs: strategies
There are generally three ways to treat missing inputs: delete them, replace them, or keep them. We will illustrate these methods on employment length. When deleting, you can either delete the observations where missing inputs are detected, or delete an entire variable. Typically, you would only want to delete observations if there is just a small number of missing inputs, and would only consider deleting an entire variable when many cases are missing.
6. Delete rows
Using this construction with which() and is-dot-na(), the rows with missing inputs are deleted in the new data set loan_data_no_NA.
7. Delete column
To delete the entire variable employment length, you simply set the employment length variable in the loan data equal to NULL. Here, we save the result to a copy of the data set called loan_data_delete_employ. Making a copy of your original data before deleting things can be a good way to avoid losing information, but may be costly if working with very large datasets.
8. Replace: median imputation
Second, when replacing a variable, common practice is to replace missing values
9. Replace: median imputation
with the median of the values that are actually observed. This is called median imputation.
10. Keep
Last, you can keep the missing values, since in some cases, the fact that a value is missing is important information. Unfortunately, keeping the NAs as such is not always possible, as some methods will automatically delete rows with NAs because they cannot deal with them. So how can we keep NAs? A popular solution is coarse classification.
Using this method, you basically put a continuous variable into so-called bins. Let's start off making a new variable emp_cat, which will be the variable replacing emp_length. The employment length in our data set ranges from 0 to 62 years. We will put employment length into bins of roughly 15 years, with groups 0 to 15, 15 to 30, 30 to 45, 45 plus, and a "missing” group, representing the NAs. Let's see how this changes our data.
11. Keep: coarse classification
12. Keep: coarse classification
13. Bin frequencies
Let's look at the plot of this new factor variable. It appears that the bin '0-15' contains a very high proportion of the cases, so it might seem more reasonable to look at bins of different ranges but with similar frequencies,
14. Bin frequencies
You can get these results by trial and error for different bin ranges, or by using quantile functions to know exactly where the breaks should be to get more balanced bins.
15. Final remarks
Before trying all of this in R yourself, let me finish the video with a couple of remarks. First, all the methods for missing data handling can also be applied to outliers. If you think an outlier is wrong, you can treat it as NA and use any of the methods we have discussed in this chapter.
16. Final remarks
Second, you may have noticed I only talked about missingness for continuous variables in this chapter. What about factor variables? Here's the basic approach. For categorical variables, deletion works in the exact same way as for continuous variables, deleting either observations or entire variables. When we wish to replace a missing factor variable, this is done by assigning it to the modal class, which is the class with the highest frequency. Keeping NAs for a categorical variable is done by including a missing category.
17. Let's practice!
Now, let's try some of these methods yourself!