Profiling, versioning, and feature stores

1. Profiling, versioning, and feature stores

In this lesson we will talk about data profiling, versioning, and feature stores. We will explain what they are and how they make our life easier. Let's begin.

2. Data profiling

Within MLOps, data profiling refers to the automated data analysis and creation of high-level summaries, called data profiles, or expectations, which we use for validating and monitoring data in production.

3. Purpose of profiles 1

Data profiles allow us to

4. Purpose of profiles 2

give feedback to the user if they are providing wrong inputs.

5. Purpose of profiles 3

make decisions on when to retrain our model.

6. Risk of NOT using data profiles

Without them, for one, we risk getting blamed by clients for unexpected predictions, when in fact they submitted erroneous inputs to the model and, two, we simply have no way to identify that the data has drifted and our model needs to be retrained.

7. Checklist: Profile

To summarize, our model training pipeline should include a data profiling step, and we should store data profiles together with other model metadata. in the metadata store.

8. great_expectations

One of the most famous contemporary tools for data profiling is great_expectations and it belongs to the Python open source ecosystem.

9. Data Versioning

Now, what our training pipeline should also record, is the exact version of the input data we used to train our model.

10. We call this versioning

We call this data versioning

11. To ensure reproducibility

and we need it most of all to ensure reproducibility.

12. Not saving a copy

This doesn't mean we should store a copy of the full training dataset within each model deployment package.

13. Data stays in place

We usually leave that data in some centralized location

14. Just a pointer

and just record a pointer to it together with the model metadata. We should also make sure to save some kind of a dataset fingerprint, that allows us to verify that not a single record in this dataset has changed in the meantime. If it has, our model is no longer reproducible.

15. Bonus: train/test

For bonus MLOps points, we will have our training pipeline also record metadata that allows us to reconstruct the exact train-test split used for performance estimation.

16. DVC

A very popular tool for data versioning is DVC, which stands simply for Data Version Control.

17. Feature stores

Lastly, let's talk about feature stores. This is a novel concept that is gaining more and more acceptance in the world of MLOps engineering.

18. Essentially a DB

Essentially, it's a central database that stores data specifically prepared for ML training and inference

19. Cross-project

which enables us to reuse features across models and projects.

20. Dual DB

They are often implemented as so-called "dual databases"

21. High-volume DB

where one is highly optimized for grabbing large volumes of data for training

22. Row DB

and the other for fast retrieval of individual records at prediction time.

23. Reusability

Apart from the obvious benefit of reducing the time needed for feature engineering,

24. Train-serve skew

a feature store also greatly helps to avoid the so-called training-serving skew. Training-serving skew is when our model performs significantly worse in production, than it did at training time. This will, for example, happen if we clean the input data during training, but forget to also do it production.

25. Emails

Imagine training a spam filter on clean text emails, unaware that your model will mostly receive HTML emails in production. This is a widespread mistake in practice, and it can take a long time to identify, during which your performance is suffering. Luckily, MLOps practices are there to help and mitigate such risks.

26. Let's practice!

Well, we covered some exciting concepts here. Let's practice and make them stick!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.