Label encoding

1. Label encoding

Pitfalls aside, let's assume we have prepared the data correctly. In this lesson, we will begin to look at a very useful technique for categorical columns, label encoding.

2. What is label encoding?

Label encoding is a technique that codes categorical values as integers. In Python, these codes often start at 0 and end at n - 1, where n is the number of categories. A -1 code is often used to indicate any missing values. Label encoding is used to save memory and to simplify responses when using survey data. Although the codes created through label encoding can be used in machine learning models, this is not the best encoding method for machine learning. We will cover a better technique in the next lesson.

3. Creating codes

To create an encoding, let's convert the manufacturer name of the used cars dataset to a categorical column. We can get a label encoding by using cat-dot-codes, which will convert the values to integers. If the column is not ordinal, the codes will be assigned in alphabetical order. Here we make a new Series, called manufacturer code, that contains the integer values that correspond to manufacturer names.

4. Check output

Let's check the output. Subaru is the first manufacturer name in the dataset, but is the 46th name in alphabetical order. It has been assigned a code of 45. Chrysler is the 9th in order, and has been given a code of 8.

5. Code books / data dictionaries

As mentioned at the start of the video, label encoding is often used in surveys. The responses and their corresponding codes are often kept in a code book or a data dictionary. Consider this variable from the American Housing Survey, where a 1 represents YES and a 2 represents NO, for if a house was built in the last four years.

6. Creating a code book

If you do create a label encoding and save the new dataset, you will want to create a map from the new codes to the old values. This can be done by creating an object for the codes and an object for the categories. Python has a built-in function called, zip, that can be used to iterate through the entries of codes and categories one at a time. If we place the zip function inside of a dictionary call, the unique combinations of codes and categories will be added as key-value pairs. Printing the name-map reveals which code maps to which category. As stated earlier, Subaru was the first manufacturer name to show up in the dataset, but is 46th in alphabetical order.

7. Using a code book

We can then use the name map we just created to convert the codes back to their original categorical values. This happens a lot when using surveys, as responses are often stored as numbers to save on memory. We can convert the column back to the original categories using the dot-map method and specifying our name map. dot-map is similar to dot-replace, and it will replace the Series values based on the keys of the name-map and their corresponding values. dot-map is used in this context because we have a complete mapping. Every single value in the manufacturer code column should have a key in the name-map dictionary.

8. Boolean coding

When creating a label encoding for a categorical column, it is common to create a Boolean code that represents a group of categories. For example, say we wanted to create a boolean code for all cars that were vans. We have already seen how to find the cars with a body type that contains the letters v-a-n. We can use the NumPy function, where, to say anytime this statement is true, we want to have a 1 value, and anytime this statement is false we want to have a 0. Looking at the output, only about 4,400 of the 38,000 used cars have van in their body type name.

9. Encoding practice

Let's work through a couple of examples.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.