How MLOps teams successfully operate

1. How MLOps teams successfully operate

Welcome back!

2. Limited impact of machine learning (the past)

Prior to MLOps, single data scientists often built models on their local computers only remotely aligned with the business strategy. Firms pursued scattered pilots to get familiar with machine learning. These model results were presented to stakeholders but often did not make it into production. The development time took considerable time and human effort because little to no automation or standardization was involved. Each application was built from scratch. Furthermore, these results could often not be reproduced because neither data nor code was stored systematically. This led to a low business impact of machine learning.

3. How MLOps teams successfully operate

Let us now discuss how MLOps can overcome these hurdles. We will not focus here on technologies but on how we work to make an impact with machine learning. Many of these insights are borrowed from established DevOps practices.

4. Reminder DevOps

As we've learned, DevOps is about minimizing friction between development and operations to leverage the business outcome of IT applications. If those teams are separated, developers might constantly push new features to respond to new business requirements, while operators in charge of stability and maintainability of the application might want to reduce and delay changes. Joint responsibilities ease these tensions and reduce the number of handoffs required, each requiring communication and risking knowledge loss.

5. Common DevOps practices

What are common DevOps practices that help us to be successful with MLOps? To increase user experience and reduce risks, DevOps aims to release frequently and reliably. The idea is constantly adding small features or code changes and automatically testing and deploying them. That allows us to spot errors early and easily roll back these minor modifications in case something goes wrong. Only working on small pieces of code is essential to allow smooth processes, rapid responses, and avoiding slack time. Automating the whole release process is a vital element of DevOps, as is the utilization of well-established software engineering principles such as testing or versioning. Being able to integrate and merge code changes from different developers is also essential. This is called continuous integration.

6. Handling failures

We implement constant and immediate feedback loops, for example through again automated tests, to ensure we identify mistakes instantly and not weeks later once they may impair the whole system. But, still, things can and will go wrong unintentionally. In this case, we find out the root cause and conduct a post-mortem to do better in the future without punishing the person responsible. A blameless culture is critical to ensure that mistakes will not be hidden and people feel free to try out risky but innovative and impactful practices. It is also vital to encourage sharing what one learned through making a mistake and, in turn, as a team, being able to improve and avoid them in the future.

7. People and teams

As we discussed earlier, successful MLOps is about collaboration between people with different skills. Even within the established DevOps domain, companies design teams and roles differently. Google, for example, famously invented the widely adopted site reliability engineers who, at Google, should not spend more than 50 percent on operations. Netflix is known for its "Operate what you build" culture and uses full-cycle developers responsible for all relevant steps from design to development, operations, and support. While the implementations differ between firms; autonomy, joined incentives, feedback loops, and trust are common. People should be innovative and, accordingly, rewarded for taking risks. Teams should stay together long-term to foster trust, collaboration, learning, and improvement.

8. Experimentation

The ultimate focus is on the business outcome. That requires striving for the highest quality. We achieve that through many of the elements discussed before, but an additional vital dimension is experimentation. We perform experiments, such as A/B testing, to roll out a new model version only to some users and comparing hereby live the performance of the new model in contrast to the old one. We may even inject planned, intentional failures to test the system's robustness.

9. Let's practice!

This was quite a lot. Therefore, let's practice what we learned!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.