Get startedGet started for free

Encoding categorical variables

1. Encoding categorical variables

Because models in scikit-learn require numerical input, if the dataset contains categorical variables, we'll have to encode them. Let's take a look at how to do that.

2. Categorical variables

Often, real-world data contains categorical variables to store values that can only take a finite number of discrete values. For example, here's a set of some user data with categorical values. We have a subscribed column, with binary yes or no values, as well as a column with users' favorite colors, which has multiple categorical values.

3. Encoding binary variables - pandas

The first encoding we'll cover is encoding binary values, like in the column shown. This is actually quite simple, and can be done in both pandas and scikit-learn. In pandas, we can use the apply method to encode 1s and 0s in a DataFrame column. Using apply, we can write a conditional that returns a 1 if the value in subscribed is y, and a 0 if the value is n. Looking at a side by side comparison of the columns, we can see that the column is now numerically encoded. pandas could be a good choice if we've not finished preprocessing, or if we're interested in further exploratory work once we've encoded.

4. Encoding binary variables - scikit-learn

We can also encode binary variables in scikit-learn using LabelEncoder. It's useful to know both methods if, for example, we're implementing encoding as part of scikit-learn's pipeline functionality, which allows us to string together different steps of the machine learning workflow. Creating a LabelEncoder object also allows us to reuse this encoding on other data, such as on new data or a test set. To encode values in scikit-learn, we'll need to instantiate the LabelEncoder transformer. We can use the fit_transform method to both fit the encoder to the data as well as transform the column. Printing out both the subscribed column and the new column, we can see that the y's and n's have been encoded to 1s and 0s.

5. One-hot encoding

One-hot encoding encodes categorical variables into 1s and 0s when there are more than two values to encode. It works by looking at the entire list of unique values in a column, transforming each value into an array, and designating a 1 in the appropriate position to encode that a particular value occurs. For example, in the fav_color column, we have three values: blue, green, and orange. If we were to encode these colors with 0s and 1s based on this list, we would get something like this: blue would have a 1 in the first position followed by two zeros, green would have a one in the second position, and orange would have a one in the last position. So an encoded column would look something like this.

6. One-hot encoding

We can use the pandas get_dummies function to directly encode categorical values in this way.

7. Let's practice!

Now it's your turn to encode categorical values.