Get startedGet started for free

MLOps best practices and pitfalls

1. MLOps best practices and pitfalls

Hello again! Now that we know to measure our progress toward MLOps, let's discuss how we can design the process as smoothly as possible.

2. Content of this video

We will touch upon best practices to succeed with MLOps and pitfalls to avoid. Since collaboration and culture are important elements of MLOps, best practices are essential to streamlining machine learning development and operations. We will relate these best practices to the steps of the MLOps life cycle we discussed earlier.

3. Best practices around designing an MLOps application

Let's start with the design phase and with data first. Getting better data often impacts model or application performance more than better models or other input variables or features. Therefore, the primary best practice is getting the best input data, not later tuning the best model on mediocre data. This is sometimes called a data-centric approach. As we have repeatedly mentioned, we should always start with a business question and evaluate if machine learning is necessary and what is feasible to achieve with ML. This can mean we want to do a proof of concept or PoC first, particularly if we are unsure if we can translate the problem into something that a machine learning algorithm can understand and solve. Early on, we should identify where we expect the bottlenecks, and optimally, we will even learn something important if a PoC fails.

4. Infrastructure

A challenge for MLOps applications regarding the infrastructure is a fluctuating demand in computer power. During training and peak operations phases, we may need to scale up the infrastructure rapidly, and our infrastructure needs to be prepared to do so. We also should have a reasonable estimate of how expensive this architectural flexibility is.

5. Best practices concerning modeling

Next, moving to the development phase. It mostly makes sense to begin model development with a simple model for several reasons: it is rather quickly developed and can serve as a baseline comparison of how much we can improve. Furthermore, we often understand its behavior better than those of more complex models. Importantly, by adding new models or model variations, we should always check how much our model is better than the status quo. There should be a significant value-added when using more complex models concerning technical metrics as well as business outcomes.

6. Testing

We have mentioned this already when discussing the MLOps life-cycle: there should be a culture of automated testing. We want to test code, models, the data quality across different stages of our application, and the complete pipeline. This is critical because individual team members often work only on specific parts of the pipeline, but each step can induce errors that are hard to spot.

7. Code quality and reproducibility

During development and operations, we need to ensure that we understand and reproduce, even years later, what we did at a certain point and why. For that, testing should also cover the code quality. That means not just that our code does not produce errors but is at the same time easy to understand, maintainable, and adaptable.

8. Knowledge transfer

To reproduce earlier results, we need to version our code and models and maintain a strong culture of documentation, well-written code, and knowledge transfer. Data versioning is trickier, but we should at least store the datasets used for experimentation or re-training.

9. Document insights gained

In machine learning, we often perform analyses that do not make it into the final code. Here, maintaining the insights gained through these experiments is also a good practice.

10. Best practices concerning operations

When it comes to the actual operations of our application, we need to establish good and automated logging and monitoring practices. Our team should be trained to react to downtimes and, more generally, to re-train models, preferably automatically. This is sometimes called Continuous Monitoring and Continuous Training, extending the CI/CD-DevOps practices we learned about earlier.

11. Let's practice!

Let's practice best practices!