Get startedGet started for free

Validation usage

1. Validation usage

In the previous lesson, we've learned about basic cross-validation strategies. Now, we'll consider one more and also explore the general validation process.

2. Data leakage

To start with, let's introduce a new term called 'data leakage'. Leakage causes a model to seem accurate until we start making predictions in a real-world environment. We then realize that the model is of low quality and becomes absolutely useless. There are different types of data leakage. The first one is a leak in the features. It means that we're using data that will not be available in the production setting. For example, predicting sales in US dollars, while having exactly the same sales in UK pounds as a feature. Another one is a leak in the validation strategy. It occurs when the validation strategy does not replicate the real-world situation. We will see an example in the next slide.

3. Time data

Suppose we're solving the problem with time series data. As a validation strategy, we selected the usual K-fold. The folds distribution for K equals four is presented on the slide. What leakage can we observe here? What's wrong with a simple K-fold strategy? The problem here is that in the second split we'll build a model using data from the future! Obviously, in the real-world setting, we will not have access to the future data. Therefore, this is an example of leakage in the validation strategy.

4. Time K-fold cross-validation

Thus, we need to be more careful with the time data. One of the possible approaches is time K-fold cross-validation. The underlying idea is to provide multiple splits in such a manner that we train only on past data while always predicting the future.

5. Time K-fold cross-validation

Time K-fold cross-validation is also available in scikit-learn model_selection. Let's create a TimeSeriesSplit object with 5 splits. Before applying it to the data, we need to sort the train DataFrame by date. And then, as usual, iterate through each cross-validation split.

6. Validation pipeline

OK, we've considered various cross-validation strategies. Now, let's define the general pipeline of the validation process for any cross-validation scheme. Firstly, create an empty list where we will store the model's results. Split train data into folds. Here, CV_STRATEGY object should be substituted with the strategy we're using. Then, for each cross-validation split, we perform the following steps. Train a model using all except for a single fold. Make predictions on this unseen single fold. Calculate the competition metric and append it to the list of folds metrics. As a result, we have a list of K numbers representing model quality for each fold.

7. Model comparison

Now we could train two different models and for each model get a list of K numbers. For example, we have Models A and B with mean squared errors in four folds. Our goal is to select the model with better quality. However, it's hard to make conclusions comparing K numbers simultaneously. So, the next step is to transform K fold scores into a single overall validation score.

8. Overall validation score

The simplest way to obtain a single number is to find the mean over all fold scores. However, the mean is not usually a good choice, because it does not take into account score deviation from one fold to another. We could get a very good score for a single fold, while the performance on the rest K-1 folds is poor. Let's define a more reliable overall validation score. It uses the worst-case scenario considering validation score one standard deviation away from the mean. We add standard deviation if the competition metric is being minimized and subtract standard deviation if the metric is being maximized.

9. Model comparison

In our example, taking the mean over all folds suggests that Model B has a lower error. However, if we calculate the overall score taking into account the scores deviation. It occurs that, actually, model A is a bit better.

10. Let's practice!

All right, enough words! Let's try all these ideas on practice!