Cross validtion helps to improve your score on out of sample data set

Till here, we have seen methods which can improve the accuracy of a model. But, it is not necessary that higher accuracy models always perform better (for unseen data points). Sometimes, the improvement in model’s accuracy can be due to over-fitting too.

Here Cross-Validation helps to find the right answer to this question. Cross Validation says, try to leave a sample on which you do not train the model and test the model on this sample before finalizing the model. This method helps us to achieve more generalized relationships. To know more about this cross validation method, you should refer article “Improve model performance using cross-validation“ .

Common methods used for Cross-Validation ?

The Validation set Approach:

In this approach, we reserve 50% of the dataset for validation and rest 50% for model training. A major disadvantage of this approach is that we train a model on 50% of the dataset only, it may be possible that we are leaving some interesting information about data i.e. higher bias.

Leave one out cross validation (LOOCV)

In this approach, we reserve only one data-point of the available data set. And, train model on the rest of data set. This process iterates for each data point. This approach leads to higher variation in testing model effectiveness because we test against one data point. So, our estimation gets highly influenced by that one data point. If the data point turns out to be an outlier, it can lead to higher variation.

K-fold cross validation

In this method, we follow below steps: * Randomly split your entire dataset into k-”folds”. * For each k folds in your dataset, build your model on k – 1 folds of the data set. * Then, test the model to check the effectiveness for kth fold and record the error you see on each of the predictions. * Repeat this until each of the k folds has served as the test set.

The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

How to choose right value of k for K-fold cross validation?

Possible answers

Choose lower value of K

Choose a higher value of K

Use k=10

Introduction to Python for Data Analysis

Python Libraries and data structures

Exploratory analysis in Python using Pandas

Data Munging in Python using Pandas

Building a Predictive model in Python

Expert advice to improve model performance

Exercise