Get startedGet started for free

Logs

1. Logs

Logs are critical in data and ML pipelines. They provide traceability by capturing detailed information at each step.

2. Pipeline logs

Typically, for both data ETL and ML logs, we log the following components: process metadata, such as statistics and information about job runtime; the results of validation tests; and a role-based evaluation of the overall process - success or failure. In the case of model refresh, we also track the model performance using evaluation metrics.

3. Pipeline logs

Having a robust logging system enables us to: Set up the pipeline observability in production to monitor the pipeline's health When a failure occurs, having effective logs helps diagnose the root cause faster. This is critical when your pipeline serves customers or supports a live decision-making system. Additionally, logs support role-based decisions during pipeline runtime. We will review pipeline observability in the next chapter.

4. Setting logs

To capture logs during the pipeline runtime, we use a simple Python function that Sets the log schema Defines methods to update the logs based on the possible scenarios, such as: No new data is available The validation outcome is success or failure The job completion status, and At the end of the job runtime, append the log to the main log table. Let's see how it works.

5. The ETL log process

Using the pipeline-dedicated ETL module, we define the log schema during the execution of the DAG's first task. The Log class provides the supporting functionality to capture logs during the pipeline runtime. We will define the series details and use the create_log method to set the schema when triggering the pipeline. The log's schema includes the following fields

6. The ETL log process

the series ID

7. The ETL log process

the runtime timestamp,

8. The ETL log process

the GET request parameters

9. The ETL log process

the series details

10. The ETL log process

the validation results

11. The ETL log process

and an indicator to flag if the refreshed data was appended successfully to the main table

12. The ETL log process

Next, we retrieve the series metadata from the API and check if new data is available

13. The ETL log process

by comparing the series's last timestamp locally and on the API.

14. The ETL log process

If no new data is available, we use the no_updates method to update the log.

15. The ETL log process

If new data is available, the function returns the GET request parameters,

16. The ETL log process

which we use to retrieve the incremental data from the API.

17. The ETL log process

In the case of failure, we use the failure method to update the log accordingly.

18. The ETL log process

Otherwise, we start the validation process by defining a validation object, the data schema, executing the validation,

19. The ETL log process

and update the log based on the validation results.

20. The ETL log process

If the validation fails, we use the failure method to update the log.

21. The ETL log process

Otherwise, we update the status of the log runtime as a success and append both the incremental data and the log to the actual and log tables, respectively.

22. The ETL log process

Each row in the log table represents a specific runtime of the pipeline. In this case, we have three types of outcomes:

23. The ETL log process

Data was available, and the process was completed successfully.

24. The ETL log process

No new data is available

25. The ETL log process

And the initial backfill of the data

26. The forecast log process

Likewise, we log the forecast information

27. The forecast log process

and the forecast performance

28. The forecast log process

The forecast log file includes:

29. The forecast log process

The forecast metadata,

30. The forecast log process

The results of the validation tests

31. The forecast log process

And the forecast performance metrics.

32. Time for practice!

Time for practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.