Profiling, versioning, and feature stores
1. Profiling, versioning, and feature stores
In this lesson we will talk about data profiling, versioning, and feature stores. We will explain what they are and how they make our life easier. Let's begin.2. Data profiling
Within MLOps, data profiling refers to the automated data analysis and creation of high-level summaries, called data profiles, or expectations, which we use for validating and monitoring data in production.3. Purpose of profiles 1
Data profiles allow us to4. Purpose of profiles 2
give feedback to the user if they are providing wrong inputs.5. Purpose of profiles 3
make decisions on when to retrain our model.6. Risk of NOT using data profiles
Without them, for one, we risk getting blamed by clients for unexpected predictions, when in fact they submitted erroneous inputs to the model and, two, we simply have no way to identify that the data has drifted and our model needs to be retrained.7. Checklist: Profile
To summarize, our model training pipeline should include a data profiling step, and we should store data profiles together with other model metadata. in the metadata store.8. great_expectations
One of the most famous contemporary tools for data profiling is great_expectations and it belongs to the Python open source ecosystem.9. Data Versioning
Now, what our training pipeline should also record, is the exact version of the input data we used to train our model.10. We call this versioning
We call this data versioning11. To ensure reproducibility
and we need it most of all to ensure reproducibility.12. Not saving a copy
This doesn't mean we should store a copy of the full training dataset within each model deployment package.13. Data stays in place
We usually leave that data in some centralized location14. Just a pointer
and just record a pointer to it together with the model metadata. We should also make sure to save some kind of a dataset fingerprint, that allows us to verify that not a single record in this dataset has changed in the meantime. If it has, our model is no longer reproducible.15. Bonus: train/test
For bonus MLOps points, we will have our training pipeline also record metadata that allows us to reconstruct the exact train-test split used for performance estimation.16. DVC
A very popular tool for data versioning is DVC, which stands simply for Data Version Control.17. Feature stores
Lastly, let's talk about feature stores. This is a novel concept that is gaining more and more acceptance in the world of MLOps engineering.18. Essentially a DB
Essentially, it's a central database that stores data specifically prepared for ML training and inference19. Cross-project
which enables us to reuse features across models and projects.20. Dual DB
They are often implemented as so-called "dual databases"21. High-volume DB
where one is highly optimized for grabbing large volumes of data for training22. Row DB
and the other for fast retrieval of individual records at prediction time.23. Reusability
Apart from the obvious benefit of reducing the time needed for feature engineering,24. Train-serve skew
a feature store also greatly helps to avoid the so-called training-serving skew. Training-serving skew is when our model performs significantly worse in production, than it did at training time. This will, for example, happen if we clean the input data during training, but forget to also do it production.25. Emails
Imagine training a spam filter on clean text emails, unaware that your model will mostly receive HTML emails in production. This is a widespread mistake in practice, and it can take a long time to identify, during which your performance is suffering. Luckily, MLOps practices are there to help and mitigate such risks.26. Let's practice!
Well, we covered some exciting concepts here. Let's practice and make them stick!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.