1. Missing data: what can go wrong
Hello and welcome to the course. My name is Micha? Oleszak and I will be your instructor for Handling Missing Data with Imputations in R.
2. What you will learn
This course will teach you how to correctly handle incomplete datasets. Upon finishing, you will understand why missing data require a special treatment. You will be able to use statistical tests and visualization tools to detect patterns in missing data, as well as perform imputation with a collection of statistical and machine learning models. Finally, you will learn how to incorporate uncertainty from imputation into your analyses and predictions, making them more robust.
3. Prerequisites
Before we start, please make sure you are familiar with the following topics. I will assume you can do basic data manipulations with dplyr and the pipe operator. You should also be able to fit and interpret the results of linear and logistic regression models. Finally, some basic probability knowledge might come in handy. Without further ado, let's get started!
4. Missing data primer
As it was famously pointed out: "the best way to treat missing data is not to have them". It's always better to have data, than not. You only have as much information in your data as many observed values there are. However cleverly you treat the missing ones, the result would always be inferior to the ideal scenario of not having any missing data in the first place. That said, we live in an imperfect world with plenty of missing data everywhere: it comes as a result of nonresponse in surveys, technical issues with data-collecting equipment, joining data from different sources, some of which might lack some data, and many others. While analyzing possibly incomplete data, we have to stay watchful. Let's illustrate why with an example.
5. NHANES data
NHANES stands for "National Health and Nutrition Examination Survey". Here, we are looking at a subset of these data containing some physical measurements of the respondents. We can check the number of missing entries in each variable by feeding the data frame to the "is.na" function with the pipe operator, and then feeding the result to "colSums". As we can see, except age and gender, all other variables are not fully observed. Let's try predicting whether a person has been diagnosed with diabetes.
6. Linear regression with incomplete data
Let's start with a simple model predicting diabetes using age and weight. The model has fit without any warnings, but if we dig deeper into the model summary, we will find out that 10 observations were deleted due to missing values. What if the deleted observations were somehow different from the rest? That would introduce bias to the results. But okay, let's continue and add the cholesterol level to the explanatory variables. Now we have 95 observations dropped. What if these are the people whose cholesterol was so high that it maxed out the measuring device, and so it was not recorded? We would certainly not want to ignore these cases. And then, which of the two models is better? You may be tempted to look at the adjusted R-squared to answer this. It cannot be used, however, because the two models were trained on two different data samples: a different number of observations were removed.
7. Main takeaways
Let's wrap up this video. Missing data is sometimes silently ignored by the software. Consequently, it could be impossible to compare different models and the results obtained from simply dropping incomplete observations can be biased. If present, missing data need a special treatment. Throughout this course, you will learn how to handle missing data correctly.
8. Let's practice!
Let's practice what you've learned so far.