Cross-validation using sklearn

As explained in Chapter 2, overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.

While the train/test split technique you learned in Chapter 2 ensures that the model does not overfit the training set, hyperparameter tuning may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:

  • it splits the dataset into a training set and a testing set
  • it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall…)
  • it repeats the process k times in total
  • it outputs the average of the 10 scores

In this exercise, you will use Cross Validation on our dataset, and evaluate our results with the cross_val_score function.

This exercise is part of the course

HR Analytics: Predicting Employee Churn in Python

View Course

Exercise instructions

  • Import the function for implementing cross-validation, cross_val_score(), from the module sklearn.model_selection.
  • Print the cross-validation score for your model, specifying 10 folds with the cv hyperparameter.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the function for implementing cross validation
from sklearn.model_selection import ____

# Use that function to print the cross validation score for 10 folds
print(____(model,features,target,____=10))