1. Handling missing data
Now let's look at how to handle missing data.
2. Missing data
When there is no value for a feature in a particular row, we call it missing data.
This can happen because there was no observation or the data might be corrupt. Whatever the reason, we need to deal with it.
3. Music dataset
Previously we worked with a modified music dataset. Now let's inspect the original version, which contains one thousand rows. We do this by chaining pandas' dot-isna with dot-sum and dot-sort_values.
Each feature is missing between 8 and 200 values!
4. Dropping missing data
A common approach is to remove missing observations accounting for less than 5% of all data.
To do this, we use pandas' dot-dropna method, passing a list of columns with less than 5% missing values to the subset argument.
If there are missing values in our subset column, the entire row is removed.
Rechecking the DataFrame, we see fewer missing values.
5. Imputing values
Another option is to impute missing data. This means making an educated guess as to what the missing values could be.
We can impute the mean of all non-missing entries for a given feature.
We can also use other values like the median.
For categorical values we commonly impute the most frequent value.
Note we must split our data before imputing to avoid leaking test set information to our model, a concept known as data leakage.
6. Imputation with scikit-learn
Here is a workflow for imputation to predict song popularity.
We import SimpleImputer from sklearn-dot-impute.
As we will use different imputation methods for categorical and numeric features, we first split them, storing as X_cat and X_num respectively, along with our target array as y.
We create categorical training and test sets.
We repeat this for the numeric features. By using the same value for the random_state argument, the target arrays' values remain unchanged.
To impute missing categorical values we instantiate a SimpleImputer, setting strategy as most frequent. By default, SimpleImputer expects NumPy-dot-NaN to represent missing values.
Now we call dot-fit_transform to impute the training categorical features' missing values!
For the test categorical features, we call dot-transform.
7. Imputation with scikit-learn
For our numeric data, we instantiate another imputer. By default, it fills values with the mean.
We fit and transform the training features, and transform the test features.
We then combine our training data using numpy-dot-append, passing our two arrays, and set axis equal to 1.
We repeat this for our test data.
Due to their ability to transform our data, imputers are known as transformers.
8. Imputing within a pipeline
We can also impute using a pipeline, which is an object used to run a series of transformations and build a model in a single workflow.
To do this, we import Pipeline from sklearn-dot-pipeline. Here we perform binary classification to predict whether a song is rock or another genre.
We drop missing values accounting for less than five percent of our data.
We convert values in the genre column, which will be the target, to a 1 if Rock, else 0, using numpy-dot-where.
We then create X and y.
9. Imputing within a pipeline
To build a pipeline we construct a list of steps containing tuples with the step names specified as strings, and instantiate the transformer or model.
We pass this list when instantiating a Pipeline.
We then split our data, and fit the pipeline to the training data, as with any other model.
Finally, we compute accuracy.
Note that, in a pipeline, each step but the last must be a transformer.
10. Let's practice!
Now let's create a pipeline to handle missing data and build a model!