Get startedGet started for free

Missing data

1. Missing data

We've considered some feature types and how to create new features for them. In this lesson, we will deal with the missing data.

2. Missing data

Some machine learning algorithms like XGBoost or LightGBM can treat missing data without any preprocessing. However, it's always a good idea to implement your own missing value imputation in order to improve the model. For example, consider the data presented on the slide. Let's assume that we need to solve the binary classification problem with labels 0 and 1. We have one categorical feature and one numerical feature. We'll consider how to deal with the missing values in order to impute gaps in the data. For example, observations with IDs 4 and 5 have missing values. Note that they are denoted as 'NaN' values in pandas DataFrames.

3. Impute missing data

For the numerical features the simplest method is mean or median imputing. It means that we fill each missing value with the mean or median of the available observations.

4. Impute missing data

In this example, we would change the missing value to 4.72. However, imputation with mean or median just assigns an average observation to the missing value. So, we lose the information that this value was actually missing. To emphasize that the data was missing sometimes special constant values are used.

5. Impute missing data

For example, -999. It's not a good choice for linear models but works perfectly for tree-based models.

6. Impute missing data

To impute the categorical features we again have two choices. Either to fill in the most frequent category in the data,

7. Impute missing data

In this example it would be category A. Or create a new category for the missing values. It again allows the model to get information that this observation had missing value.

8. Impute missing data

For example, create a new category 'MISS' and fill in the missing value.

9. Find missing data

Let df be the pandas DataFrame from the example table presented on the previous slides. The pandas' method .isnull() returns the DataFrame with Booleans as cell values. If the value is missing, it returns True. If the value is present, it returns False. Therefore, we could call the .sum() method on this DataFrame and obtain the number of missing values in each column. In this case, we have one missing categorical feature and one missing numerical feature.

10. Numerical missing data

Let's now consider Python implementation. Again we will use the scikit-learn package. Import SimpleImputer from the impute module. To impute numerical data, we could create an object of this class. For mean imputing, we set the strategy parameter to 'mean'. For constant imputing, we set the strategy to 'constant' and specify the filling value (in this example, -999). Finally, we impute the value applying the fit_transform() method to the selected columns. Note that we could select multiple columns to be imputed simultaneously. For this purpose, just pass the list of columns to be imputed. Note that even if we want to impute a single column, we have to use double brackets.

11. Categorical missing data

Imputation of categorical missing data is absolutely similar. We again could use two different strategies: the most frequent category or constant category. In this case, for example, category 'MISS' for the missing data. Then we apply the selected imputer to the list of columns we'd like to impute.

12. Let's practice!

So, now you know the approaches to impute missing data. Let's polish them on practice!