1. Why do missing values exist?
In the first chapter, we looked at the different types of data one may find when analyzing data. In this lesson, we will explore the concept of messy and missing values, how to find them, and once identified, how to deal with them.
2. How gaps in data occur
While in an ideal world every dataset you come across would be perfectly complete and contain no gaps, unfortunately, this is rarely the case.
Real world data often has noise or omissions. This can stem from many sources, for example:
Data not being collected properly (paper surveys not being filled out fully).
Collection and management errors (someone transcribing the data making a mistake).
Data intentionally being omitted (people may want to skip the age box in an online form).
Or gaps could be created due to transformations of the data (average of a field with missing data).
This list is far from comprehensive.
3. Why we care?
You may wonder why are we discussing this? Does missing data even matter? Yes, it does, and it is extremely important to identify and deal with missing data.
Many machine learning models cannot work with missing values, for example if you were performing linear regression, you would need a value for every row and column used in your dataset.
Missing data may be indicative of a problem in your data pipeline. If data is consistently missing in a certain column, you should investigate as to why this is the case.
Missing data may provide information in itself. For example, if the number of children of a person is missing they may have no children.
4. Missing value discovery
You can use the info() method to have a preliminary look at how complete the dataset is.
Right from the get go you can see that the StackOverflowJobsRecommend, Gender, and RawSalary columns are highly underpopulated and we should examine where these missing values occur. This list output is useful but becomes limited with larger datasets that have missing values scattered all over their features.
5. Finding missing values
To find where these missing values exist, you can use the isnull() method as shown here. All cells where missing values exist are shown as True.
6. Finding missing values
You can also count the number of missing values in a specific column by chaining the isnull() and sum() methods as shown here.
7. Finding non-missing values
The inverse (or the non missing values) can also be found using the notnull() method. Here, all missing values are shown as False.
Note that you can call the isnull() and notnull() methods on both the DataFrame as a whole, and on each of it's individual columns.
8. Go ahead and find missing values!
It's time for you to find missing values in the Stackoverflow data!