1. Missing data and imputation
Welcome back! Hopefully, you are starting to become comfortable using Python with Power BI. In this lesson, we will move on to understanding how Python can be used to make identifying missing data and addressing it more efficient.
2. What is missing data?
Missing data is simply an occurrence of null, NA, 99, or blank value. These are some common ways to indicate that the column value for the ith row is missing.
3. What is missing data?
Using the previous fisheries dataset as an example, you can see the amount fished in Australia in 1989 and 1993 are both "null". This indicates a missing value for those years.
4. Why is data missing?
There can be missing data for a variety of reasons. Here are a couple of examples.
First, a participant did not answer the entire survey.
A participant dropped out of a study, providing only one set of measurements.
There was a glitch in the instrument used to obtain measurements.
Privacy laws restrict the use of some of the participants' data.
5. Is it missing at random?
There are two large categories of missing data - missing at random and missing not at random.
Missing at random is when missing data is unrelated to other values of other variables in the dataset. It simply does not have a value.
Looking at the example table, the missing values occur in all three cities.
6. Is it missing at random?
However, sometimes missing data is more systemic within the dataset, collection process, or sample. This is known as “Missing Not at Random” or Non-Ignorable. Therefore, you cannot ignore it without biasing the results.
In the example dataset, all of the missing data is from Seattle. Therefore, it may be the case some underlying reason is causing the missing data.
7. Is it missing at random?
Some examples of why this may occur are an instrument can’t detect a low reading, so it gives an error, or individuals with higher incomes are less likely to reveal yearly income on a survey.
8. How to address missing data?
To understand how to proceed with an analysis, it’s important to determine the cause of the missing data and whether it is missing at random or not.
If the data is missing not at random, it may be best to pause the analysis until a further understanding of missing data and possible continuation of data collection. Or clearly document limitations and be clear about how the data is interpreted.
If the data is simply missing at random, there are methods for addressing this. These are to delete the observations, add an indicator variable that is a one if missing and zero if not, or imputation.
9. Imputation
Imputation is the process of replacing a missing value with another. A common approach is to use the column mean, median, or mode. Typically this is only used when 5% or less of the data is missing.
You could also use the previous or next values in the column. Here, it is important to sort appropriately!
10. Imputation - Example
Going back to the previous example, we can use the median for each city to impute the missing values. This is a simple example, but it can become more complex to fit the dataset's characteristics.
11. Dataset
In this chapter, you will be working with the "purchases" dataset.
This dataset is a log of transactions from an online retailer. It contains columns for the unique identifier for the transaction, Invoice, which customer made the transaction, what was purchased, how much, and the price per unit.
12. Let's practice!
In the following exercises, you will continue to learn the difference between Power BI and python in looking for missing data. Then, you will use python to perform imputation to replace the missing values.
Let’s practice!