1. Preprocessing data
Welcome to the final chapter of the course!
2. scikit-learn requirements
Recall that scikit-learn requires numeric data, with no missing values. All the data that we have used so far has been in this format.
However, with real-world data, this will rarely be the case, and instead we need to preprocess our data before we can build models.
3. Dealing with categorical features
Say we have a dataset containing categorical features, such as color.
As these are not numeric, scikit-learn will not accept them and we need to convert them into numeric features.
We achieve this by splitting the feature into multiple binary features called dummy variables, one for each category.
Zero means the observation was not that category, while one means it was.
4. Dummy variables
Say we are working with a music dataset that has a genre feature with ten values such as Electronic, Hip-Hop, and Rock.
5. Dummy variables
We create binary features for each genre.
As each song has one genre, each row will have a 1 in one of the ten columns and zeros in the rest.
If a song is not any of the first nine genres, then implicitly, it is a rock song. That means we only need nine features, so we can
6. Dummy variables
delete the Rock column. If we do not do this, we are duplicating information, which might be an issue for some models.
7. Dealing with categorical features in Python
To create dummy variables we can use scikit-learn's OneHotEncoder, or pandas' get_dummies.
We will use get_dummies.
8. Music dataset
We will be working with a music dataset in this chapter, for both classification and regression problems.
Initially, we will build a regression model using all features in the dataset to predict song popularity. There is one categorical feature, genre, with ten possible values.
9. EDA w/ categorical feature
This box plot shows how popularity varies by genre. Let's encode this feature using dummy variables.
10. Encoding dummy variables
We import pandas, read in the DataFrame, and call pd-dot-get_dummies, passing the categorical column. As we only need to keep nine out of our ten binary features, we can set the drop_first argument to True.
Printing the first five rows, we see pandas creates nine new binary features. The first song is Jazz, and the second is Rap, indicated by a 1 in the respective columns.
To bring these binary features back into our original DataFrame we can use pd-dot-concat, passing a list containing the music DataFrame and our dummies DataFrame, and setting axis equal to one.
Lastly, we can remove the original genre column using df-dot-drop, passing the column, and setting axis equal to one.
11. Encoding dummy variables
If the DataFrame only has one categorical feature, we can pass the entire DataFrame, thus skipping the step of combining variables.
If we don't specify a column, the new DataFrame's binary columns will have the original feature name prefixed, so they will start with genre-underscore - as shown here. Notice the original genre column is automatically dropped.
Once we have dummy variables, we can fit models as before.
12. Linear regression with dummy variables
Using the music_dummies DataFrame, the process for creating training and test sets remains unchanged.
To perform cross-validation we then create a KFold object, instantiate a linear regression model, and call cross_val_score. We set scoring equal to neg_mean_squared_error, which returns the negative MSE. This is because scikit-learn's cross-validation metrics presume a higher score is better, so MSE is changed to negative to counteract this.
We can calculate the training RMSE by taking the square root and converting to positive, achieved by calling numpy-dot-square-root and passing our scores with a minus sign in front.
13. Let's practice!
Now let's practice working with categorical features.