Get startedGet started for free

One-hot encoding

1. One-hot encoding

Hello - and welcome to our final lesson. Here we focus on a short introduction to one-hot encoding.

2. Why not just label encoding?

In the last lesson, we learned about label encoding. Consider this mapping that we had used in the last lesson. What do you notice? You might recall that the keys are assigned in alphabetical order, or the order of the category if the column is ordinal. If you try to use a column of these codes in a machine learning model, the algorithm might misinterpret their meaning. Remember, algorithms train on numbers! For example, diesel with a value of 0 might be given less weight than gasoline with a value of 3. We need a better approach.

3. One-hot encoding with pandas

In a previous exercise, we created a zero-one column for a single value of a single column. Fortunately, we can do this for all values of a single column, or even all columns at one time using the pandas function get-dummies. One-hot encoding is the process of creating dummy variables, hence the name get-dummies. This function has several inputs. Data is the DataFrame we are using. Columns is a list of the column names we want to encode, and prefix is a string that will be added to the beginning of the new column names. Let's look at a few examples in action.

4. One-hot encoding on a DataFrame

Calling pd-dot-get-dummies on a DataFrame will apply one-hot encoding on all object and categorical columns. In this example, the DataFrame has just two columns after we subset it, odometer-value and color. Take a look at their current values.

5. One-hot encoding on a DataFrame continued

The syntax is pd-dot-get-dummies with the DataFrame as the first parameter. Used-cars-onehot will now have all object and categorical columns one-hot encoded. Any numeric columns will remain the same. Color had twelve unique color values, so we now have one column per color. A 0 indicates that the car was not that color, while a 1 indicates that the car was that color. Our new DataFrame will have 13 total columns. One for the odometer value and 12 for the new color columns. The original color column is dropped.

6. Specifying columns to use

It may be important to use get-dummies on only a subset of columns, as you may not want to encode all object or categorical variables. In this example, we are one-hot encoding on the color column only. Since we are only doing one column, we have decided to set the prefix to blank. All other columns in the used cars dataset will be left alone, but the color column will be converted to twelve columns, one for each color. Notice the new names for the columns: underscore-black and underscore-blue. In this example, we did not subset the used cars dataset. There are now 41 total columns. 29 from the original dataset, and 12 for the new color columns.

7. A few quick notes

There are a few things to consider when using one-hot encoding. First, if your columns have a lot of unique values, an equal amount of new columns will be created. Training machine learning models on a lot of columns may lead to a problem known as overfitting, something we would like to avoid. Just look what happens when we use the entire used cars dataset. We now have over 1,000 total columns. Secondly, NaN values do not get their own column. This is OK though. If all created columns for a variable are 0, this indicates that the original column was blank. There is no need to have a column for missing values.

8. One-hot encoding practice

Before you start preparing data for machine learning models, let's work on a couple of examples.