Get startedGet started for free

Data splitting

1. Training and testing datasets: splitting data

So far you have built some nice looking plots, derived new variables, and used information value to determine if your independent variables have any predictive power. We will now start working on the "real" objective of this course: building a prediction model to determine whether an employee will leave the organization.

2. What is a model?

The basic idea is that you input some data into an algorithm (also called model) which learns to find patterns in the input data and gives you predictions as the output.

3. Why to split data into train and test?

Once an algorithm learns the underlying patterns of the input data, you should test it on unseen data because you don't know how well it is going to perform on unseen data - which is a true measure of model quality. You generally have access to limited data, so it's best to withhold some data for testing purposes. You can do this by splitting your data into two parts, training and testing. You train (or build) a model using training data and then evaluate how well your model performs using the testing data.

4. Splitting data with caret

We will use the createDataPartition() function from the caret package to split the data into training and testing datasets. Since splitting data this way is inherently a random process, let's set a seed to ensure we always get the same results. The first argument to createDataPartition() is our target variable, turnover. Then we set p to 0.5, indicating that we want 50% of the data in the train set. Finally, we set the list argument to FALSE so the result is a matrix of row numbers for the training set. You can then use this result, i.e., index_train to create the training and testing data frames. The data is split in such a way that both the datasets are balanced, i.e., they have the same proportion of active and inactive employees in both datasets.

5. Let's practice!

Now go ahead and split the data in training and testing datasets.