Dataset shift
1. Dataset shift
Welcome back! Sometimes, models perform less well in production than in development. This is usually attributed to overfitting, but there is another possibility. Your production data might differ in subtle ways from your training data so that the patterns you had extracted are now obsolete; a phenomenon known as "dataset shift". Let's see how we can use scikit-learn to overcome it.2. What is dataset shift?
A common example of dataset shift is temporal change. Consider the following dataset on electricity prices in New South Wales, where you are tasked to predict whether the price will go up or down on the basis of a number of features. A structural change in the market occurred midway through the period of interest. Would the performance of a classifier built on earlier data be affected? Let's investigate.3. What is shifting exactly?
Consider a scatterplot between two variables from this data, with class 1 examples drawn in yellow and class 0 in purple. We also display the decision boundary of a naive Bayes classifier trained on this data. The data shown here are from a time period before the suspected shift occurred.4. What is shifting exactly?
On the right, you can see the analogous plot for recent data. The decision boundary has indeed shifted significantly. Hence, a classifier trained on old data is far less relevant than one trained on fresh data. In some parts of the literature, the true decision boundary is known as the "concept" which we are attempting to learn, so that dataset shift is also known as "concept drift".5. Windows
An easy way to protect your model against dataset shift is to regularly retrain it using only a batch of recent data, a technique known as windowing. This is easy to implement in pandas using location indexing: look back a fixed number of data points from the current position in the stream; that number is known as the "window size". This is in contrast to training on the full data available so far, also known as an expanding window.6. Dataset shift detection
Comparing the performance of sliding and expanding windows in a champion-challenger setup is a great dataset shift detector: expanding windows contain more data than sliding windows, so the only reason for the latter to win is dataset shift. Consider what happens at time = 40000 of the electricity dataset. First, train a classifier on the entire dataset up to that time. Then, train a classifier on the last 20 thousand data points only. We want to compare the performance of these two classifiers on future data. In other words, we will use all the data from time 40 thousand onwards as test data. A quick comparison in terms of AUC suggests that the sliding window outperforms the expanding one, confirming dataset shift.7. Window size
The window size is a critical parameter: windows that are too small have too little information, but windows that are too large might contain old data that confuse the classifier. It is good practice to fit a classifier on windows of different size, and pick the size that performs best on a test dataset. In the arrhythmia data case, as you can see from the plot, a window of size 50 seems best.8. Domain shift
Temporal change is not the only driver of dataset shift. Another possibility is that the data source subtly changes from development to production, a phenomenon referred to as domain shift. Take the example of an app trying to diagnose arrhythmia from ECG data. Perhaps the model was built on an open-source dataset of elderly patients and then struggles to deal with younger subjects in production.9. More data is not always better!
Time for you to practice tuning the window size yourself.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.