Building your final model

1. Final steps to nirvana

Great job detecting and removing multicollinear variables.

2. Build your final model

In the following exercises, you will build the final logistic regression model using the train_set_final dataset which contains all the relevant independent variables. After you fit the final model, you will predict the probability of turnover.

3. Predicting probability of turnover

You can use the predict() function to find the probability of a datapoint belonging to a target class. Let's first predict turnover on the training data. This is also known as in-sample prediction. Pass the model object and then set the newdata argument to the training data. Also, make sure you set the type argument to "response" to ensure you get a vector of probabilities. We randomly print the probability of turnover for two employees here. As you can see, the probability of turnover for employee 205 is 6%, whereas, for employee 645, it is 99.99%!!

4. Plot probability range: training dataset

It's a good practice to explore the range of predicted probabilities of all the observations in training dataset to make sure that the range of your probabilities is between 0 and 1. You can do this using a histogram. To quickly plot a histogram, you can use the hist() function from base R, as shown here. A small range means that predictions for the cases do not lie far apart, and therefore the model might not be very good at discriminating which employee is going to leave the organization. You can also see that the model predicts the turnover probability for a large chunk of employees as almost zero, which is the case with our dataset. Thus, we feel confident about testing this model with the testing dataframe we created in the first lesson.

5. Predicting probability: testing dataset

To test the model on the previously unseen test dataset, all you need to do is change the dataset passed to newdata. This is also called the out of sample prediction. The benefit of this process is to identify how accurately will your model predict on the new or unseen dataset.

6. Plot probability range: testing dataset

Here you can see that the probability distribution of predictions of the testing dataset is almost similar to that of the training dataset. Thus, you can assume that your model is good at predicting unseen data.

7. Let's practice!

Now it's your turn to build the final model and calculate the probability of turnover.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.