Handling missing values

1. Handling missing values

In the previous lesson you were introduced to the two null value types that you encounter in python. In this lesson, you will assign null values to the missing values in the dataset!

2. Missing values

Missing values in a dataset aren't usually left unfilled, they are filled with dummy values like 'NA', '-' or '.' etc. In this lesson, you will learn to detect such missing values as well as replace them with 'NaN'.

3. Detect missing values in College dataset

Let's use the 'college' dataset which contains various details of college students as an example. We'll load data using 'pd.read_csv()' of 'college.csv'. The first step in analyzing the dataset is to read and print a snippet of the dataset. We'll print the head of the 'college' DataFrame. Find that all columns have float values. If you observe clearly, you can see that a few data points are filled with a period! This suggests that missing values might be represented by a period.

4. Detect missing values in College dataset

However, we can confirm this only through further analysis. We'll use the info() method to get a gist of the dataset. Hey, somethings' odd here! All the columns except 'private' are of 'object' type although they are supposed to be float. We can further explore and confirm by finding the unique values in one of the columns. This way we can find any non-numerical values!

5. Detect missing values in College dataset

Let's apply the '.unique()' method on the column 'csat' and sort them using 'np.sort()'. From the output you can clearly observe that '.' is the only string value present. Hence, we need to replace it with 'NaN'.

6. Replace missing values in College dataset

This can be simply achieved while loading the dataset to a DataFrame. You can use the argument 'na_values' in 'pd.read_csv' to specify the values for missing data.

7. Replace missing values in College dataset

If you again check the 'info()' of 'college', you'll find that all the columns are now 'float64' type. This is great! Now, let's consider another dataset to detect hidden missing values.

8. Detect missing values in Diabetes dataset

We will use the Pima Indians Diabetes dataset which contains various clinical diagnostic information of the patients from the Pima community. While loading the dataset we can observe 'NaN' values for missing data when you print the head of the DataFrame.

9. Detect missing values in Diabetes dataset

As before, let's print the 'info()' of the 'diabetes' DataFrame. They are all 'float' or 'int' type as expected.

10. Detect missing values in Diabetes dataset

Further, we can analyze using the 'describe()' method on the 'diabetes' DataFrame. Observe closely. Something very odd here is that the 'BMI' column has a minimum value of 0. But we are aware that BMI cannot be 0. Hence, the 0's must rather be missing values in disguise!

11. Detect missing values in Diabetes dataset

To confirm the same, we can filter all the rows where 'BMI' is 0. There are 11 rows which have BMI as 0. They must be missing values. These types of missing values can be tricky as they require some level of domain knowledge.

12. Replace missing values with NaN

We'll replace these 11 zeros of BMI column with 'NaN' and check again using 'np.isnan()' of diabetes.BMI. Great! Now that we have successfully removed the hidden missing values and replaced them with 'NaN's, let's summarize what we learned in this lesson!

13. Summary

We learned to detect missing value characters like '.', detect the inherent missing values within the data like '0' and replace them with NaNs. In the next lesson, you'll dig deeper

14. Let's practice!

into analyzing the missing values. But it's now time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.