Get startedGet started for free

Handling missing values

1. Handling missing values

Out in the wild, data is rarely clean and can contain missing values.

2. Finding missing values

Missing values can appear due to various reasons, most of which are out of our control. So it is our job to handle them properly. The first step in this journey is to find them. The describe function, which we already know, shows us the number of missing values in each column. Here, we can see that the penguins dataset contains missing values in the island and body-mass columns.

3. ismissing()

If we want to extract rows with a missing value in a certain column, we can use the ismissing function. To print the rows where the island is missing, we subset the DataFrame by calling ismissing function and broadcasting it over the island column. Note that the broadcasting is necessary.

4. ismissing()

If we want to return just some columns, we can replace the colon with column names. Here we only return the species and sex columns.

5. dropmissing()

Once we locate the missing values, we have several options for what to do with them. We can drop all the rows containing missing values by calling dropmissing and passing the DataFrame name. Or, if we want to keep missing values except for one column, we call dropmissing and pass the DataFrame name and the column name from which to drop missing values.

6. replace()

We already know how to replace missing values in a column using a predefined value or the column mean using the replace function. Remember, we need to use the skipmissing function as it allows us to calculate mean by skipping missing values.

7. Replacing with grouped summary statistics

Now, we'll learn how to replace missing values using grouped summary statistics. Getting back to the penguins example, we would like to impute the missing body mass values. While we can calculate the mean of the whole column, it would make more sense to replace missing values for individual species with the mean or median weight for that species. Or even better, we could replace the missing weights based on the species and sex of the penguins.

8. Replacing using groupby()

We can do it the following way. We group penguins by species and iterate over the groups. For each group, we subset the group on the missing values by using ismissing on the body mass column. As we only want to replace the values in the body mass column, we include it in the subset. We assign the new value as the mean value for the column. As the original values were integers, we round the mean. Lastly, we check using the describe function.

9. Replacing using multiple columns

We can use this process even if we group by both species and sex.

10. Insufficient data

What happens if there are no values in the respective group, though? We get an argument error. So what to do then? Well, there are several options. Maybe it is because we divided the DataFrame into too many groups - in that case, we can group by using fewer or different columns. If that is not the case, we need to be more creative. For example, we can fall back on a predefined value, and use the median, mean, or the minimum of the whole column. What we choose depends a lot on the situation. Or, if you are familiar with try-catch statements, you can use those.

11. Cheat sheet - find and drop missing values

Here is a cheat sheet summarizing what we've learned about finding and dropping missing values.

12. Cheat sheet - replace missing values

And here is a cheat sheet about replacing missing values.

13. Let's practice!

Are you ready to say goodbye to missing values? Let's head to the exercises to practice!