Kidney disease case study I: Categorical Imputer
You'll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.
As Sergey mentioned in the video, you'll be introduced to a new library, sklearn_pandas
, that allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, you'll be able to use the DataFrameMapper()
class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.
We've also created a transformer called a Dictifier
that encapsulates converting a DataFrame using .to_dict("records")
without you having to do it explicitly (and so that it works in a pipeline). Finally, we've also provided the list of feature names in kidney_feature_names
, the target name in kidney_target_name
, the features in X
, and the target in y
.
In this exercise, your task is to apply sklearn's SimpleImputer
to impute all of the categorical columns in the dataset. You can refer to how the numeric imputation mapper was created as a template. Notice the keyword arguments input_df=True
and df_out=True
? This is so that you can work with DataFrames instead of arrays. By default, the transformers are passed a numpy
array of the selected columns as input, and as a result, the output of the DataFrame mapper is also an array. Scikit-learn transformers have historically been designed to work with numpy
arrays, not pandas
DataFrames, even though their basic indexing interfaces are similar.
This exercise is part of the course
Extreme Gradient Boosting with XGBoost
Exercise instructions
- Apply the categorical imputer using
DataFrameMapper()
andSimpleImputer()
.SimpleImputer()
does not need any arguments to be passed in. The columns are contained incategorical_columns
. Be sure to specifyinput_df=True
anddf_out=True
, and usecategory_feature
as your iterator variable in the list comprehension.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer
# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)
# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object
# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()
# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()
# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
[([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns],
input_df=True,
df_out=True
)
# Apply categorical imputer
categorical_imputation_mapper = ____(
[(category_feature, ____) for ____ in ____],
input_df=____,
df_out=____
)