Get Started

sklearn's cross_val_score()

1. sklearn's cross_val_score()

Hello again. Next, we are going to discuss cross-validation in scikit-learn.

2. cross_val_score()

We have seen that KFold() is a great way to create indices that we can use for cross-validation. If you just want to jump straight into cross-validation and don't want to mess with the indices, you can use scikit-learn's cross_val_score() method. This method requires four parameters. First, we have the estimator or the specific model that you want to use. In this example, we have a RandomForestClassifier() with the default model settings. Next, we use X to specify the complete training dataset and y to specify the response values. Lastly, the parameter cv allows us to specify the number of cross-validation splits (or folds). In this example, we have set the parameter cv to 5, allowing us to perform 5-fold cross-validation. By default, cross_val_score() will use a default scoring function for whichever model you have specified. For example, if you have a RandomForestClassifer as the estimator, the default scoring function is the mean overall accuracy. For most regression models, it will return the R-squared value.

3. Using scoring and make_scorer

If you want to use a different scoring function, you can create a scorer by using the make_scorer() method, and specifying the scoring metric that you want to use. Here we create a scorer for the mean_absolute_error() function by calling make_scorer() on scikit-learn's method for calculating the mean absolute error. Finally, we set the scoring parameter equal to the newly created mae_scorer inside the function.

4. Full example

Let's run through a full example of using scikit-learn's cross_val_score() for a regression model. The first step is to load all of the necessary methods. We load the model, cross_val_score(), and both the make_scorer() and mean_squared_error() methods. Next, we specify the regression model we want to use, with the specific parameters, as well as create the scorer that should be used when running the regression model. Finally, we call cross_val_score() on the estimator, rfr, the dataset X, the response values y, and set scoring equal to the scorer we generated. In this example, we set cv to 5 to complete 5-fold cross-validation.

5. Accessing the results

Let's look at the results. Notice how varied the mean squared errors are. The lowest was almost 86, while the highest was well over 200. If we have chosen an 80/20 split on the data at random, we may have reported an error as low as 86, or an error as high as 223. When we use cross-validation, we usually report the mean of the errors. In this case, it was 150. This is a much more realistic estimate for the out-of-sample accuracy that we can expect to see on new data. Eighty-six was probably way too low of an error, while 223 was way too high. Finally, we can look at the standard deviation to see how varied the five results were. The smaller the standard deviation, the tighter your 5 means were. This indicates that the actual accuracy for new data will probably match the mean of the cross-validation score fairly well.

6. Let's practice!

Let's now use cross_val_score() to perform cross-validation.