Get startedGet started for free

Analyze the amount of missingness

1. Analyze the amount of missingness

You've done a great job at detecting missing values in the previous exercise. The next major step for you is to analyze the missing data. A basic analysis that can be performed for analyzing missingness, is to find the total number and percentage of missing values for the columns in the dataset.

2. Load Air Quality dataset

We will use the air quality dataset which contains the sensor recordings of Ozone, Solar, Temperature and Wind. This is a time-series dataset. Let's load the dataset using "pd.read_csv()" with the arguments "parse_dates" equal to 'Date' and "index_col" also equal to the 'Date'. Printing the head of the dataset shows that there are a few null values. So let's now analyze the missingness in the dataset!

3. Nullity DataFrame

We can either use the '.isnull()' or '.isna()' methods on the DataFrame to obtain the nullity of a DataFrame. Both the methods are the same! The methods return a DataFrame of 'True' and 'False' where 'True' implies missing and 'False' implies not missing. We have stored this DataFrame as 'airquality_nullity' This DataFrame can be called the nullity DataFrame or dummy DataFrame. But , let's just stick to nullity DataFrame.

4. Total missing values

We can then apply the method '.sum()' on 'airquality_nullity' to obtain the number of missing values in the DataFrame. This method sums the Trues and Falses as 'True' is numerically '1' and 'False' is '0'.

5. Percentage of missingness

Similarly, to find the percentage of missing values for a column, we can apply the method "mean" and multiply by 100 on the DataFrame 'airquality_nullity'.

6. Nullity Bar

For a better understanding, let us now graphically visualize the amount of missing values using the 'missingno' package. The 'missingno' package is a library that provides functions for graphical analysis of missing data. We import the package 'missingno' as 'msno'. We will use the function "msno.bar()" on "airquality" to visualize the completeness of the DataFrame.

7. Nullity Matrix

Another very important graphical analysis that we need to do is to visualize the locations of missing values in the dataset. This allows us to quickly analyze the patterns in missing values. We can create such a plot using the function 'msno.matrix()'. The plot describes the nullity in the dataset and appears blank wherever there are missing values.

8. Nullity Matrix

The sparkline on the right summarizes the general shape of data completeness and

9. Nullity Matrix

points out the row with the minimum number of null values in the DataFrame as well as

10. Nullity Matrix

the total count of columns at the bottom.

11. Nullity Matrix for time-series data

Since this is a time-series dataset, we can set the frequency to 'M', that is a month, to obtain a nullity matrix ranging over time. This way we can clearly observe during which season there is a higher amount of missingness. From this plot, we can observe that there are higher amounts of

12. Nullity Matrix for time-series data

missing values in the month of June.

13. Fine tuning the matrix

We can further slice the DataFrame between the months of May and July to obtain more clarity on the amount of missingness. Slicing will particularly be helpful when analyzing large datasets.

14. Summary

Summarizing everything covered in this lesson, we learned to analyze the amount of missingness both numerically and graphically. We then analyzed the percentage of missingness. Finally, we also learned to analyze the nullity matrix for regular as well as time-series datasets.

15. Let's practice!

It's now time to practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.