1. Learn
  2. /
  3. Courses
  4. /
  5. HR Analytics: Predicting Employee Churn in Python

Exercise

Cross-validation using sklearn

As explained in Chapter 2, overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.

While the train/test split technique you learned in Chapter 2 ensures that the model does not overfit the training set, hyperparameter tuning may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:

  • it splits the dataset into a training set and a testing set
  • it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall…)
  • it repeats the process k times in total
  • it outputs the average of the 10 scores

In this exercise, you will use Cross Validation on our dataset, and evaluate our results with the cross_val_score function.

Instructions

100 XP
  • Import the function for implementing cross-validation, cross_val_score(), from the module sklearn.model_selection.
  • Print the cross-validation score for your model, specifying 10 folds with the cv hyperparameter.