Transforming categorical variables

1. Transforming categorical variables

Now that we know what are the categorical variables in our dataset we can start transforming them into numerical.

2. Types of categorical variables

To transform a categorical variable into numeric, we have to understand it's type first. There are two types of categorical variables: ordinal and nominal. Ordinal variables have two or more categories that can be ranked or ordered. In our case that is the **salary** column, where the values clearly have a logical order. The 2nd type is Nominal, where categories do not have any intrinsic or logical order. An example of this kind of variable in our dataset is the column **department**, as its values clearly do not have any order or rank: sales department is not higher than hr or vice versa and so on. Based on what type of categorical variable you have, there are different methods for transforming them.

3. Encoding categories (salary)

For the case of ordinal variables we can encode categories by converting each of them into a respective numeric value. There are 3 steps to accomplish that tasks in Python. - First, we have to tell Python, that the column salary is actually categorical. This is done using a method called **astype()** which is providing the type of the variable. - Then, once Python knows that it is a categorical variable, we have to tell the correct order of categories, using cat.reorder_categories() method. As you can see in the code, this method takes a list as an input, where the correct order of categories is provided. - Last but not least, we have to use cat.codes attribute to encode each category with a numeric value given our order. The result will overwrite the old values of salary column with new numeric values as presented in the table.

4. Getting dummies

The next categorical variable is nominal, as there is no order or rank between departments. This means that encoding approach is not useful anymore. In this case, transformation should be accomplished trough the so called dummy variables. Dummy variables are the variables that get only two values 0 or 1. Let's say an employee is from the technical department. This means if we have a separate column for each department, then the mentioned employee will have value of 1 in the column for technical and 0 in the columns of all other departments. This means we will have to create a new DataFrame where each department is a separate column and each row is a separate employee with 1s in front of his/her department and 0 in all other places. While the task seems to be confusing, it is very easy from technical perspective due to a very nice function from pandas called get_dummies().

5. Dummy trap

When dealing with dummy variables one should be cautious of a phenomenon known as dummy trap. The latter is the situation when different dummy variables convey the same information. In this example, the sample employee is from the technical department, so it is the only column with a value of 1 in the first table. In the 2nd table, the last column is dropped, but we can still understand that the employee is from technical department by looking at all the other departments that have value of 0. For that reason, whenever in similar situations dummies are created one of them can be dropped as its information is already included in others.

6. Let's practice!

Okay, time to put this into practice.