Analyze the amount of missingness
1. Analyze the amount of missingness
You've done a great job at detecting missing values in the previous exercise. The next major step for you is to analyze the missing data. A basic analysis that can be performed for analyzing missingness, is to find the total number and percentage of missing values for the columns in the dataset.2. Load Air Quality dataset
We will use the air quality dataset which contains the sensor recordings of Ozone, Solar, Temperature and Wind. This is a time-series dataset. Let's load the dataset using "pd.read_csv()" with the arguments "parse_dates" equal to 'Date' and "index_col" also equal to the 'Date'. Printing the head of the dataset shows that there are a few null values. So let's now analyze the missingness in the dataset!3. Nullity DataFrame
We can either use the '.isnull()' or '.isna()' methods on the DataFrame to obtain the nullity of a DataFrame. Both the methods are the same! The methods return a DataFrame of 'True' and 'False' where 'True' implies missing and 'False' implies not missing. We have stored this DataFrame as 'airquality_nullity' This DataFrame can be called the nullity DataFrame or dummy DataFrame. But , let's just stick to nullity DataFrame.4. Total missing values
We can then apply the method '.sum()' on 'airquality_nullity' to obtain the number of missing values in the DataFrame. This method sums the Trues and Falses as 'True' is numerically '1' and 'False' is '0'.5. Percentage of missingness
Similarly, to find the percentage of missing values for a column, we can apply the method "mean" and multiply by 100 on the DataFrame 'airquality_nullity'.6. Nullity Bar
For a better understanding, let us now graphically visualize the amount of missing values using the 'missingno' package. The 'missingno' package is a library that provides functions for graphical analysis of missing data. We import the package 'missingno' as 'msno'. We will use the function "msno.bar()" on "airquality" to visualize the completeness of the DataFrame.7. Nullity Matrix
Another very important graphical analysis that we need to do is to visualize the locations of missing values in the dataset. This allows us to quickly analyze the patterns in missing values. We can create such a plot using the function 'msno.matrix()'. The plot describes the nullity in the dataset and appears blank wherever there are missing values.8. Nullity Matrix
The sparkline on the right summarizes the general shape of data completeness and9. Nullity Matrix
points out the row with the minimum number of null values in the DataFrame as well as10. Nullity Matrix
the total count of columns at the bottom.11. Nullity Matrix for time-series data
Since this is a time-series dataset, we can set the frequency to 'M', that is a month, to obtain a nullity matrix ranging over time. This way we can clearly observe during which season there is a higher amount of missingness. From this plot, we can observe that there are higher amounts of12. Nullity Matrix for time-series data
missing values in the month of June.13. Fine tuning the matrix
We can further slice the DataFrame between the months of May and July to obtain more clarity on the amount of missingness. Slicing will particularly be helpful when analyzing large datasets.14. Summary
Summarizing everything covered in this lesson, we learned to analyze the amount of missingness both numerically and graphically. We then analyzed the percentage of missingness. Finally, we also learned to analyze the nullity matrix for regular as well as time-series datasets.15. Let's practice!
It's now time to practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.