1. Dealing with missing values (I)
Now that you can recognize why missing values occur and how to locate them, you need to know how they can be dealt with.
2. Listwise deletion
If you are confident that the missing values in your dataset are occurring at random, (in other words not being intentionally omitted) the most effective and statistically sound approach to dealing with them is called 'complete case analysis' or listwise deletion.
In this method, a record is fully excluded from your model if any of its values are missing.
Take for example the dataset shown here. Although most of the information is available in the first and third rows, because values in the ConvertedSalary column are missing, these rows will be dropped.
3. Listwise deletion in Python
To implement listwise deletion using pandas, you can use the dropna() method, by setting the how argument to 'any'. This will delete all rows with at least one missing value.
4. Listwise deletion in Python
On the other hand, if you want to delete rows with missing values in only a specific column, you can use the subset argument. Pass a list of columns to this argument to specify which columns to consider when deleting rows.
5. Issues with deletion
While the preferable approach in situations where missing data occurs purely at random is listwise deletion, it does have its drawbacks.
First, it deletes perfectly valid data points that share a row with a missing value.
Second, if the missing values do not occur entirely at random it can negatively affect the model.
Lastly, if you were to remove a feature instead of a row it can reduce the degrees of freedom of your model.
6. Replacing with strings
The most common way to deal with missing values is to simply fill these values using the fillna() method. To use the fillna() method on a specific column, you need to provide the value you want to replace the missing values with. In the case of categorical columns, it is common to replace missing values with strings like 'Other', 'Not Given' etc. To replace the missing values in place, in other words to modify the original DataFrame, you need to set the inplace argument to True.
7. Recording missing values
In situations where you believe that the absence or presence of data is more important than the values themselves, you can create a new column that records the absence of data and then drop the original column.
To do this, all you need to do is call the notnull() method on a specific column. This will output a list of True/False values, thus recording the presence/absence of data.
To drop columns from a DataFrame, you can use the drop() method and specify a list of column names which you want to drop as the columns argument.
8. Practice time
With this in mind you will now work through applying listwise deletion, and some alternatives for replacing missing values in categorical columns.