1. Are the data missing at random?
Along with the overlapping racial and ethnic demographic categories, we also have missing data.
2. Missing data
This issue is pervasive in data science and can be difficult to deal with properly. In this chapter we'll introduce approaches to analyzing missing data and show how you can deal with missing data in bigmemory and iotools.
3. Types of Missing Data
Generally speaking, missing data falls into one of three categories: missing completely at random, missing at random, and missing not at random.
4. MCAR
When data are missing completely at random there is no way to predict where in the data set we'll see a missing value. In an analysis this can often be handled by simply dropping rows of a data set with missing values.
5. MAR
When missingness is associated with other variables we call it missing at random. This name is a misnomer. We really mean that conditioned on some of the variables in the data set, the data are missing completely at random. To deal with MAR data we generally predict values for the missing data several times to create multiple data sets that capture the statistical structure of the relationships between the variables and then perform an analysis on the data sets. This procedure is called multiple imputation.
6. MNAR
The last category, missing not at random is for the case where data is neither MAR nor MCAR. It is usually caused by deterministic relationships between missingness and other measurements.
7. Dealing with missing data in this course
A full analysis of missing data and strategies for dealing with them are beyond the scope of this course. There is no direct way to check if the data are MCAR, so, we are going to check if the data are MAR, and if they are not, we will assume that the data are missing completely at random.
8. A Quick Check for MAR
To check if your data are MAR, take each column with missingness and recode it as one if it is missing and zero otherwise. Then regress each of the the other variables onto it using a logistic regression. A significant p-value indicates an association between the regressor and missingness, meaning your data are MAR. If none are significant, then it's plausible that the data are missing completely at random.
Because you are testing multiple hypotheses you will likely get some p-values that are small by chance. As a result you may need to adjust your cutoff for significance based on how many regressions you perform.
9. MAR Quick Check Example
This slide shows how to look for MAR data using R. We create a binary is_missing variable with 1,000 elements indicating where values are missing. To generate 10 independent variables, we randomly sample them from a normal distribution and place them in a matrix with 1,000 rows.
10. MAR Quick Check Example
Then we regress each column in the matrix onto the is_missing variable with a logistic regression, using R's glm() function. The p-value for this variable is in the regression summary, in the coefficients matrix, at position 2, 4.
Each of these p-values is stored and outputted to the screen. In this case, all of the p-values are large indicating that there is not a significant association between the regressors and missingness. We therefore have found that the data are not MAR. This should not come as a surprise though, since all of the data we used were generated at random.
11. Let's practice!
Now you'll check to see if the mortgage data set has data missing at random.