Imputing categorical values

1. Imputing categorical values

In the previous lessons, you have learned to impute missing values for numerical and time-series data. In this lesson, you will learn to impute Categorical data.

2. Complexity with categorical values

The complexity with categorical data is that they are usually strings. Hence, imputations cannot be applied on them. The categorical values must first be converted or encoded to numeric values and then imputed.

3. Conversion techniques

For converting the categories to numeric values, we need to encode the categories. We either opt for ordinal encoder or one-hot encoder. From the table, you can clearly observe that one-hot encoding creates a column for each category and values are 1 only for the respective category and the rest are 0. Ordinal encoding on the other hand, assigns a value to each category. While one-hot encoder is apter, we will use ordinal encoder for simplicity.

4. Imputation techniques

Next, the simplest way to impute is to simply fill in the most frequent value. We have already covered how to do this in the 2nd chapter. However, we want to have a more robust approach, so we will therefore use the KNN imputer from fancyimpute to impute the missing values. Let's perform the two operations - encoding and imputing with a dataset.

5. Users profile data

We will use a new dataset 'users', which contains the customer interests and preferences recorded by a restaurant. This dataset contains only categorical values.

6. Ordinal Encoding

In our first step, we will convert the categorical values to numerical values. We will import OrdinalEncoder from sklearn.preprocessing. The downside of this function is that it cannot handle NaNs. Therefore we need to skip the NaNs and then convert the categorical values. For this, we first select a column say 'ambience' from the DataFrame 'users' and select the non-null values. Next, we perform reshaping to (-1, 1) so that the column can be fit and transformed by the Ordinal encoder 'ambience_ord_enc'. Lastly, we replace the categorical values in the column 'ambience' to ordinal values in the 'users' DataFrame.

7. Ordinal Encoding

We can create a more generalized form for conversion by looping over the columns. Here, you must observe that we are also creating a unique encoder for each column and storing them using a dictionary 'ordinal_enc_dict'. This will help us to later convert them back to their respective categories.

8. Imputing with KNN

Our next step would be to impute the DataFrame users. For this we can create a copy users_KNN_imputed so that it can be used for comparison. We will use the KNN imputer from fancyimpute as before to fit and impute the data. Additionally, we also round the values as with 'np.round' as the imputed values need to be whole numbers and not decimal numbers. Our last step is to convert back the ordinal values to its labels using the method 'inverse_transform' with respective encoders and columns.

9. Summary

To summarize, the 3 steps in imputing missing categorical values include converting the non-missing categorical values to ordinal or numerical values, imputing the ordinal DataFrame and lastly converting back to categorical values.

10. Let's practice!

Now that you have followed the 3 steps to impute missing categorical values, its time for you to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.