Data drift

1. Data drift

Welcome back. Previously, we have learned the importance of monitoring our deployed model to ensure reliability and performance stability. In this video, we will take this a step further and look at an existential danger that afflicts many models in production - data drift.

2. The need for data drift detection

Data drift is a phenomenon where the statistical properties of the model's input features change over time. This can happen for many reasons, such as changes in the underlying population or shifts in the data collection process. Have a look at this graph showing the rates of heart disease at various ages a few decades ago when compared to today; there are fewer instances of heart disease, and they occur at an older age. A life insurance model which was trained on the old dataset would probably become more and more inaccurate over time due to improvements in healthcare. This is data drift. Data drift does not necessarily imply that the model performance will decrease; regardless, it is important that our model is trained on relevant and recent distributions of data to ensure that it accurately reflects the current situation or environment.

3. The Kolmogorov-Smirnov test

The first step in addressing data drift is detecting it through monitoring processes discussed in the previous video. Several statistical tests can be used for this purpose. In this example, we will use a statistical test called the Kolmogorov-Smirnov test. This test is often used for detecting data drift. The test compares the statistical properties of two dataset distributions or columns and highlights any significant differences between the two that might indicate data drift.

4. Using the ks_2samp() function

We'll use the scipy ks_2samp function to perform a KS test for data drift. Applying the Kolmogorov-Smirnov test via ks_2samp, we obtain a test statistic and p-value. The test statistic indicates the magnitude of difference between the distributions, while the p-value gauges the likelihood of observing such a difference if the samples were from the same distribution. If the p-value is less than 0.05, we suspect data drift, as this indicates the samples likely come from different distributions. Otherwise, no significant drift is inferred.

5. Correcting data drift

Once we have detected data drift, we should correct it. This involves updating our model to account for the new statistical properties of the data. There are several ways to do this, including retraining the model on the new data. Often, however, things are not that simple, and we will not have access to enough data from the new distribution to build a robust model. In this case, it is possible to compromise: we can periodically retrain the model on a mixed dataset of old and new, increasing the amounts of new data until we have enough.

6. Further resources for detecting and rectifying data drift

The KS test is primarily used for comparing a specific column in two different datasets - such as a newer version of an outdated dataset. The Population Stability Index (PSI) is another particularly effective method for assessing drift in individual categorical variables or specific columns between datasets. There are also libraries designed specifically for detecting and rectifying data drift. One example is Evidently, an open-source Python library made for data scientists and ML engineers. Evidently helps test, evaluate, and keep track of model performance from validation to production. Another example is NannyML, which helps monitor model performance post-deployment.

7. Let's practice!

In summary, many ways exist to detect and correct data drift. Now, we will try a hands-on data drift exercise to ensure our models stay accurate and effective over time.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.