Mean, median & mode imputations
1. Mean, median & mode imputations
In the previous chapter, you learned to extensively analyze and delete missing data. Now its time for you to start treating missing values.2. Basic imputation techniques
The simplest ways to impute missing values is by using either constants like "0" or simple statistical parameters like mean, median and mode. Here, mode implies the most frequent value occurrence in a variable.3. Mean Imputation
To perform the imputations, lets start by importing 'SimpleImputer' from 'sklearn.impute'. We can first create a copy 'diabetes_mean' of the original dataset 'diabetes' in order to compare later. We'll set the 'strategy' to 'mean' for mean imputation and create the 'mean_imputer' object. We pass the DataFrame 'diabetes_mean' to fit the data4. Mean Imputation
and impute the missing values with the method 'fit_transform()'. The method 'fit_transform()' returns a numpy array. Hence, we'll slice over all the elements of 'diabetes_mean' using the 'iloc' method.5. Median imputation
For median imputation, you simply change the the 'strategy' to 'median'. Here we've created a separate set of variables and applied the 'fit_transform()' method directly.6. Mode imputation
For mode imputation, we can simply set the 'strategy' argument of 'SimpleImputer()' to 'most_frequent' and impute the DataFrame 'diabetes_mode'7. Imputing a constant
And lastly for setting a constant, you can set the 'strategy' to constant and the attribute 'fill_value' to any constant such as "0".8. Scatterplot of imputation
Using the last chapter's scatterplot technique, we plot the mean imputed columns `Serum_Insulin` and `Glucose` of diabetes_mean using the plot method. We can distinguish between missing and non-missing values by setting the color 'c' as the sum of the nullity of both these columns and map it to the 'rainbow' color sequence. Additionally, we can also set the title to "Mean Imputation". However, if we have to compare with different imputations, we'll have to plot subplots of all of them in a single graph.9. Visualizing imputations
First we start off by creating subplots using 'plt.subplots()' and by setting both 'nrows' and 'ncols' to 2 to plot the four imputed DataFrames. We'll also set the figure size to (10, 10). Since the nullity remains the same for all the imputations, there are no changes here. We'll store the 4 DataFrames in a dictionary and loop over them to generate the graphs. We can set the dictionary keys to be the name of the imputation method we used like 'Mean Imputation' for mean imputed DataFrame and 'Median Imputation' for median imputed DataFrame and so on. We can use these keys for setting the 'title'. We'll zip 'axes' and 'imputations' and loop them. The DataFrame can be accessed using 'df_key' which is the dictionary key. The 'flatten()' method on axes flattens the axes array from (2, 2) to (4, 1). While plotting, in addition to the previous attributes, we'll also set 'colorbar=False' so that the color bar is not plotted. As mentioned earlier, we'll set the 'title' to 'df_key'.10. Insert title here...
Observing the graph, there's a clear correlation between 'Serum_Insulin' and 'Glucose'. However, the imputed values which are red just lie in a straight line as the imputed values do not vary against the other variable. Therefore, we can conclude that mean, median and mode imputations only preserve these basic statistical features of the dataset but don't account for their correlations. Moreover, this results in a bias in the dataset. In the next lessons, we will learn about more robust imputation techniques.11. Summary
To summarize this lesson, you learned to create imputations with basic statistical parameters like mean, median and mode. You learned to compare multiple imputations graphically. And lastly, you learned to analyze the quality of these imputations.12. Let's practice!
It's now time to solidify these concepts!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.