Cross-validation using sklearn
As explained in Chapter 2, overfitting the dataset is a common problem in analytics. This happens when a model has learned the data too closely: it has great performances on the dataset it was trained on, but fails to generalize outside of it.
While the train/test split technique you learned in Chapter 2 ensures that the model does not overfit the training set, hyperparameter tuning may result in overfitting the test component, since it consists in tuning the model to get the best prediction results on the test set. Therefore, it is recommended to validate the model on different testing sets. K-fold cross-validation allows us to achieve this:
- it splits the dataset into a training set and a testing set
- it fits the model, makes predictions and calculates a score (you can specify if you want the accuracy, precision, recall…)
- it repeats the process k times in total
- it outputs the average of the 10 scores
In this exercise, you will use Cross Validation on our dataset, and evaluate our results with the cross_val_score
function.
This exercise is part of the course
HR Analytics: Predicting Employee Churn in Python
Exercise instructions
- Import the function for implementing cross-validation,
cross_val_score()
, from the modulesklearn.model_selection
. - Print the cross-validation score for your model, specifying 10 folds with the
cv
hyperparameter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the function for implementing cross validation
from sklearn.model_selection import ____
# Use that function to print the cross validation score for 10 folds
print(____(model,features,target,____=10))