Addressing missing data

1. Addressing missing data

Hi, I'm George, and in this video we'll look at strategies for handling missing data.

2. Why is missing data a problem?

So, why is it important to deal with missing data? Well, it can affect distributions. As an example, we collect the heights of students at a high school. If we fail to collect the heights of the oldest students, who were taller than most of our sample, then our sample mean will be lower than the population mean. Put another way, our data is less representative of the underlying population. In this case, parts of our population aren't proportionately represented. This misrepresentation can lead us to draw incorrect conclusions, like thinking that, on average, students are shorter than they really are.

3. Data professionals' job data

Let's illustrate how missing data impacts exploratory analysis using a dataset about data professionals. This dataset includes the year the data was obtained, job title, experience level, type of employment, location, company size, time spent working remotely, and salary in US dollars.

4. Salary by experience level

To highlight the impact of missing values, let's look at salaries by experience level using a full version of the dataset. Now, let's compare this to the same data with some missing values. The y-axis shows that the largest salary is around 150000 dollars less in the second plot!

5. Checking for missing values

With our dataset stored as a pandas DataFrame called salaries, we can count the number of missing values per column by chaining the dot-isna and dot-sum methods. isna refers to the fact that missing values are represented as na in DataFrames. The output shows all columns contain missing values, with Salary_USD missing 60 values.

6. Strategies for addressing missing data

There are various approaches to handle missing data. One rule of thumb is to remove observations if they amount to five percent or less of all values. If we have more missing values, instead of dropping them, we can replace them with a summary statistic like the mean, median, or mode, depending on the context. This is known as imputation. Alternatively, we can impute by sub-groups. We saw that median salary varies by experience, so we could impute different salaries depending on experience.

7. Dropping missing values

To calculate our missing values threshold we multiply the length of our DataFrame by five percent, giving us an upper limit of 30.

8. Dropping missing values

We can use Boolean indexing to filter for columns with missing values less than or equal to this threshold, storing them as a variable called cols_to_drop. Printing cols_to_drop shows four columns. We drop missing values by calling dot-dropna, passing cols_to_drop to the subset argument. We set inplace to True so the DataFrame is updated.

9. Imputing a summary statistic

We then filter for the remaining columns with missing values, giving us four columns. To impute the mode for the first three columns, we loop through them and call the dot-fillna method, passing the respective column's mode and indexing the first item, which contains the mode, in square brackets.

10. Checking the remaining missing values

Checking for missing values again, we see salary_USD is now the only column with missing values and the volume has changed from 60 missing values to 41. This is because some rows may have contained missing values for our subset columns as well as salary, so they were dropped.

11. Imputing by sub-group

We'll impute median salary by experience level by grouping salaries by experience and calculating the median. We use the dot-to-dict method, storing the grouped data as a dictionary. Printing the dictionary returns the median salary for each experience level, with executives earning the big bucks!

12. Imputing by sub-group

We then impute using the dot-fillna method, providing the Experience column and calling the dot-map method, inside which we pass the salaries dictionary.

13. No more missing values!

Now we see there are no more missing values!

14. Let's practice!

Let's practice working with missing data!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.