Pipeline Monitoring

1. Pipeline Monitoring

Welcome back! In this video, we'll dive deeper into monitoring data pipelines by analyzing potential risks and developing mitigation plans.

2. Workflow

Setting up observability and monitoring for a data or ML pipeline starts with three key steps: mapping the pipeline's components, analyzing potential risks for each component, and developing a mitigation plan based on those risks.

3. Risk categories

When analyzing risks, we can group them into two categories: Well-defined risks are easy to detect with clear definitions. For example, electricity demand values should never be negative - that's physically impossible. Ambiguous risks are harder to identify and context-dependent. A sudden spike in electricity demand might reflect a real-world event like extreme weather, or it could signal a data issue. Since the root cause isn't always obvious, we need flexible strategies to monitor and handle these cases.

4. Mitigation plan

A solid mitigation plan should outline which logs and metrics to capture, which KPIs to monitor, and what defines success or failure for each component.

5. The pipeline components

Let's return to the pipeline architecture from the previous chapter and design its observability and monitoring setup.

6. The pipeline components

First, map out the main components. The API serves as the pipeline's external data source.

7. The pipeline components

The "check API" task detects whether new data is available.

8. The pipeline components

The data refresh task fetches new data from the API.

9. The pipeline components

The data append task adds the refreshed data to the main table.

10. The pipeline components

The forecast refresh step updates model predictions.

11. The pipeline components

Finally, the forecast scoring task evaluates model performance. Once components are mapped, we assess potential risks and define mitigation plans for each.

12. The pipeline components

Consider the API - it's critical and external, meaning we don't control how data is curated or maintained.

13. API potential risks

This introduces several risks: Data integrity issues include missing or inconsistent values. Availability issues encompass API outages or upstream delays, and Data restatements, which could occur when past observations are corrected after the fact, due to collection errors, product returns, business logic changes, or upstream delays. With risks identified, we define our mitigation plan.

14. Data integrity issues

We address data integrity issues using the pointblank library to validate incoming data. This includes checks on data schema, value ranges for specific fields, missing values, and duplicate rows. New data only appends if it passes all validation tests.

15. Restatements issues

To mitigate data restatement issues, we extend the data refresh window - pulling data for the past 336 hours, or two weeks, and overwriting previously stored observations with the most recent values. This captures any delayed or corrected updates from the upstream source.

16. Availability issues

For data availability issues, we log each pipeline run. Those logs capture

17. Availability issues

execution time,

18. Availability issues

whether new data was detected, whether the refresh completed successfully,

19. Availability issues

or if no new data was available from the API. Based on this information, we define alerting logic to detect availability problems.

20. Alerts

The final step defines the pipeline's success criteria and configures alerts for failure scenarios. We can use the EmailOperator to send notifications or implement custom alerting functions tailored to our specific needs. In the next video, we'll focus on model monitoring, reviewing model drift.

21. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.