Logs
1. Logs
Logs are critical in data and ML pipelines. They provide traceability by capturing detailed information at each step.2. Pipeline logs
Typically, for both data ETL and ML logs, we log the following components: process metadata, such as statistics and information about job runtime; the results of validation tests; and a role-based evaluation of the overall process - success or failure. In the case of model refresh, we also track the model performance using evaluation metrics.3. Pipeline logs
Having a robust logging system enables us to: Set up the pipeline observability in production to monitor the pipeline's health When a failure occurs, having effective logs helps diagnose the root cause faster. This is critical when your pipeline serves customers or supports a live decision-making system. Additionally, logs support role-based decisions during pipeline runtime. We will review pipeline observability in the next chapter.4. Setting logs
To capture logs during the pipeline runtime, we use a simple Python function that Sets the log schema Defines methods to update the logs based on the possible scenarios, such as: No new data is available The validation outcome is success or failure The job completion status, and At the end of the job runtime, append the log to the main log table. Let's see how it works.5. The ETL log process
Using the pipeline-dedicated ETL module, we define the log schema during the execution of the DAG's first task. The Log class provides the supporting functionality to capture logs during the pipeline runtime. We will define the series details and use the create_log method to set the schema when triggering the pipeline. The log's schema includes the following fields6. The ETL log process
the series ID7. The ETL log process
the runtime timestamp,8. The ETL log process
the GET request parameters9. The ETL log process
the series details10. The ETL log process
the validation results11. The ETL log process
and an indicator to flag if the refreshed data was appended successfully to the main table12. The ETL log process
Next, we retrieve the series metadata from the API and check if new data is available13. The ETL log process
by comparing the series's last timestamp locally and on the API.14. The ETL log process
If no new data is available, we use the no_updates method to update the log.15. The ETL log process
If new data is available, the function returns the GET request parameters,16. The ETL log process
which we use to retrieve the incremental data from the API.17. The ETL log process
In the case of failure, we use the failure method to update the log accordingly.18. The ETL log process
Otherwise, we start the validation process by defining a validation object, the data schema, executing the validation,19. The ETL log process
and update the log based on the validation results.20. The ETL log process
If the validation fails, we use the failure method to update the log.21. The ETL log process
Otherwise, we update the status of the log runtime as a success and append both the incremental data and the log to the actual and log tables, respectively.22. The ETL log process
Each row in the log table represents a specific runtime of the pipeline. In this case, we have three types of outcomes:23. The ETL log process
Data was available, and the process was completed successfully.24. The ETL log process
No new data is available25. The ETL log process
And the initial backfill of the data26. The forecast log process
Likewise, we log the forecast information27. The forecast log process
and the forecast performance28. The forecast log process
The forecast log file includes:29. The forecast log process
The forecast metadata,30. The forecast log process
The results of the validation tests31. The forecast log process
And the forecast performance metrics.32. Time for practice!
Time for practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.