Splitting the data

1. Splitting the data

Hello and welcome to Chapter 2. For now we already know how to transform HR data and make it ready for predictive analytics. Let's now concentrate on predictive component.

2. Target and features

Our target in this course is to predict employee turnover using data that we have on them. In business analytics or data science terminology, the variable that one aims to predict is known as target, while everything else that is used for prediction are called features. In other words, we will be using features to predict target.

3. Train/test split

To make an accurate prediction and build an algorithm that can be useful in reality, in analytics it is a usual practice to split the data into two components: train and test. Train component is used to conduct calculations, optimizations and develop the algorithm, while the remaining test component is used to validate it. For that reason, once our data is separated into target and features, the next step is to split both of them into train and test component. One of most popular Python libraries, that is widely used by data scientists and business analysts is called sklearn. In sklearn, there is almost always a built-in function for most of the analytics tasks, including train/test splitting. As you can see from the code, the function generates 4 outputs. This happens because we split between train and test both target and features so we end up with train and test components for target, and similarly for features. Last but not least, as you can see the functions takes a test_size argument which is 0.25 in our example. This argument tells sklearn to randomly choose 25% of the data and save it as test, while the rest of 75% will be kept for training. In general, when you have quite a big dataset with millions of observations, around 2-3% for test might be enough. But because our datasets in HR are not usually that big, 25% for test seems to be a good practice.

4. Overfitting

To understand better the reasoning behind train/test split, let's shortly cover the concept of overfitting. Overfitting is one of the most popular problems in analytics. Our first target is to have an accurate model that can helps us to make accurate predictions and decisions based on them. Yet, a model which is accurate on one data, might not be that much accurate on the other. So our second not less important objective is to achieve a model that is generalizable or in other words, works good not only on our current dataset but also in possible future datasets. Overfitting happens, when the model works well on the dataset it was developed on, but is not useful outside of it. So we split the data into train and test components, develop model on train and then validate it on test to make sure our model was not overfitting the training data.

5. Let's practice!

We will concentrate more on the concept of overfitting, but until then, let's practice splitting the data.