Introduction to Decision Tree classification

1. Introduction to Decision Tree classification

In this lecture we will start the process of building the predictive algorithm. What we want to accomplish is to have an algorithm, that will learn from our historical data the important variables affecting the decision of leaving the company and use that information to predict turnover. As the values our target, turnover, gets are 2: 1 and 0, this problem is called binary classification.

2. Classification in Python

There are many different data science/machine learning algorithms that one can use to address binary classification problem such as prediction of employee turnover. Each of them has its own pros and cons and business cases where they are best to apply. The algorithm which we will use showed to be quite popular in HR analytics and is called Decision Tree. The latter is popular for 2 reasons among all: 1st it is able to provide accurate predictions and 2nd, it can be used to understand factors that are driving the decision to leave the company.

3. Decision Tree Classification

The picture you see know is the visualization of a small sample Decision Tree for employee turnover. The appearance of the algorithm is the reason it is called Decision Tree. Let's go step by step over the tree to understand the classification process. The tree is growing first starting from the variable Satisfaction. It is checked whether for a given employee the satisfaction level was higher than 0.5 or not. If it was, we go to the right branch of the tree, otherwise we move to the left one. If we moved to the right, then the next question we need to ask according to the tree is whether Salary is High or not. As you can see, if Salary is High, then we reach one of the last nodes or leaves of the tree, where the output is that the employee will not Churn. Thus we have a decision path: employees with High Satisfaction level and High Salary do not Churn. Analogically, employees with low satisfaction level who spent, say 3 years with the company do Churn as presented by the last leaf of the leftmost branch of the tree. Therefore, once we have this tree, we can easily predict whether a given employee will churn or not and also understand what are the important variables that drive churn decision.

4. Splitting rule

Let's now concentrate shortly on the intuition which is used to split the tree. In general Decision Tree algorithm wants to achieve as pure samples in the last leafs as possible. Mathematically, 2 different rules are quite popular to achieve this task: Gini and Entropy. Objective is the same in both cases, we aim to minimize Gini or Entropy, and both will result in purer samples in the last nodes. As theoretically there is no proven dominance between those 2 methods, we will go on using Gini, as it is doing calculations faster.

5. Decision Tree splitting: hypothetical example

Let's discuss a hypothetical example. Assume we have a dataset of 100 people. Assume also 40 of them are leavers and 60 stayers. So now let's divide them based on satisfaction level being higher than 0.8 or not. If yes, suppose we end up with 50 people on the left branch, all stayers. On the other hand, right branch includes 10 stayers and 40 leavers. As you can see, this hypothetical splittion results in tremendously decreased Gini: from 0.48 to 0 and 0.08 in two branches respectively. As a result, we have purer samples, especially in the left branch, where we have only stayers, which helps us to make more accurate predictions.

6. Let's practice!

Good, let's new practice the theory before moving to analytics.