Get startedGet started for free

Imputing using fancyimpute

1. Imputing using fancyimpute

Welcome back to the course. In the previous chapter, we discussed imputation techniques on a variety of data types. In this chapter, we will continue by learning advanced imputation techniques with the 'fancyimpute' package.

2. fancyimpute package

'fancyimpute' is a package containing several advanced imputation techniques that use machine learning algorithms to impute missing values. In the previous lessons, we used imputation techniques like mean, median and mode imputations or interpolation. In these techniques, only the respective column was utilized for computing and imputing missing values. In contrast, the advanced imputation techniques use other columns as well to predict the missing values and impute them. Think of it as fitting a machine learning model to predict the missing values in a column using the remaining columns.

3. Fancyimpute imputation techniques

In this lesson, we will learn two very important techniques, namely, KNN or K Nearest Neighbor imputation and MICE or Multiple Imputation by Chained Equations imputation.

4. K-Nearest Neighbor Imputation

The KNN imputation technique uses the K-Nearest Neighbor algorithm for predicting the missing values. The KNN algorithm finds the most similar data points using all the non-missing features for a data point and calculates the average of these similar points to fill the missing feature. Here, K specifies the number of similar or nearest points to consider.

5. K-Nearest Neighbor Imputation

Let's use the diabetes DataFrame to impute the missing values. We can import the 'KNN()' function from fancyimpute. As before, we first create a copy 'diabetes_knn' for imputing the diabetes DataFrame in order to compare later on. Next we use the 'knn_imputer' to impute the 'diabetes' DataFrame with the '.fit_transform()' method.

6. Multiple Imputations by Chained Equations (MICE)

The MICE imputation is a very robust and complex model for imputing missing values. It imputes using multiple regressions over the data and takes an average value for filling in the missing feature for a data point.

7. Multiple Imputations by Chained Equations(MICE)

The MICE function is called 'IterativeImputer' in the fancyimpute package as it performs multiple imputations on the data. We can impute the diabetes DataFrame as we did previously by creating a copy and then imputing.

8. Summary

In this lesson, we learned to use machine learning techniques for imputing with KNN and MICE techniques. While KNN finds most similar points to impute missing values, MICE performs multiple regressions on the data to impute. MICE is a very robust model for imputing missing values.

9. Let's practice!

It's now time for you to put these concepts to practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.