Get Started

Evaluation of different imputation techniques

1. Evaluation of different imputation techniques

Welcome to the last lesson of the course! In data science, we usually impute missing data in order to improve model performance and decrease bias.

2. Evaluation techniques

A good way to measure the quality of imputations is to look at how a machine learning model performs against each imputation. In this lesson, we will use a simple linear regression on the various imputations we had done earlier. Another way to analyze the performance of different imputations is to observe their density plots and see which one most resembles the shape of the original data.

3. Fit a linear model for statistical summary

To perform linear regression, we will use the statsmodels package as it produces various statistical summaries. We will use the 'diabetes' DataFrame for comparing imputations. We can first create the complete case 'diabetes_cc' by dropping the rows with missing values. This will be the baseline model to compare against other imputations. We set 'X = sm.add_constant()' to add a constant or in other words intercept to the input diabetes_cc.iloc[:, :-1] which excludes the target column 'Class'. Next we define the target variable y as diabetes_cc['Class'] and use 'sm.OLS()' on y and X values. We apply the '.fit()' method to train the linear model.

4. Statistical summary

A detailed summary of the trained model can be obtained using 'lm.summary()'. The adjusted R-squared and the coefficients in specific can be used to evaluate the model performance.

5. R-squared and Coefficients

While the R-squared measures the accuracy of the machine learning model, the coefficients explain the weights of different features in the data. The higher the R-squared the better the model. We can get the R-Squared and coefficients of the model with the attributes 'lm.rsquared_adj' and 'lm.params' respectively.

6. Fit linear model on different imputed DataFrames

Similarly we repeat the steps for the other imputed DataFrames namely, diabetes_mean_imputed, diabetes_knn_imputed and diabetes_mice_imputed and compare each of their R-squares and coefficients. We add a constant as before to define the input X. y remains the same for all the imputations and hence defined only once. We fit each of the imputations diabetes_mean_imputed, diabetes_knn_imputed and diabetes_mice_imputed to the linear models lm_mean, lm_KNN and lm_MICE respectively.

7. Comparing R-squared of different imputations

We can compare all the R-squares by creating a temporary DataFrame for cleanly printing the R-squared values of each of the imputed DataFrame as well as the complete case. We will define each column as one imputation, that is Complete, Mean Imp, KNN Imp and MICE Imp. You will observe that mean imputation has the least R-squared as it imputes the same mean value throughout a column. The complete case has the highest R-squared as half the rows with missing values have been dropped for fitting the linear model.

8. Comparing coefficients of different imputations

We can similarly compare the coefficients of each of the imputations using the '.params' attribute. The columns Glucose, Diastolic_BP, Skin_Fold, Serum_Insulin and BMI show that the imputed values add more weight to reinforce these features in the imputations.

9. Comparing density plots

We can compare the density plots of the imputations to check which imputation most resembles the original dataset and does not introduce a bias. To do this, we can use the '.plot()' method on the 'Skin_Fold' column and set 'kind' equal to 'kde'.

10. Comparing density plots

You will observe that the mean imputation is completely out of shape as compared to the other imputations. The KNN and MICE imputations are much more identical to the base DataFrame with the peak of MICE imputation being slightly shifted!

11. Summary

In this chapter, you learned to apply a linear model from statsmodels package and compare the prediction results of the imputed DataFrames. You lastly learned to graphically compare their density plots.

12. Let's practice!

Now let's practice!