1. Monitoring a data pipeline
Great work! We've learned the basics of developing a data pipeline. Now, it's time to explore techniques to monitor pipeline performance, and alert on failure.
2. Monitoring a data pipeline
Once a data pipeline is developed, it should be monitored for changes to data, and failures during execution. Sometimes, source systems fail to provide data, or data types change. Other times, the tools that Data Engineers had previously used become deprecated or functionality changes. Whatever the reason, monitoring a data pipeline ensures the solution is transparent, and proper alerting notifies Data Engineers of an issue before data consumers discover it themselves.
3. Logging data pipeline performance
In this course, we'll use logging to alert engineers of data pipeline performance.
Logs are messages created and written during the execution of a pipeline. They are configured by the developing party, and document the performance of a pipeline. Logs provide a starting point when solutions fail by letting Data Engineers revisit the execution of the pipeline.
Logs are the foundation for all monitoring and alerting efforts, and are essential for creating transparent data pipelines. The logging module in Python makes it easy to configure and create your own logs.
There are six levels of logging provided by the logging module. We'll explore four; debug, info, warning, and error. Each has an associated function and are used to reflect differing severity events.
Debug logs are typically used when building a data pipeline, and give a Data Engineer insight into things such as data dimensionality, type, and variable values.
The info function is used to provide basic information and checkpoints throughout the execution of a pipeline, such as notifying an engineer about operations that occur on the data.
4. Logging warnings and errors
In addition to debug and info-level logs, warnings and errors should also be captured using logging.
Warnings are logged when something unexpected happens, but an exception has not necessarily occurred. A use case for a warning log could be an unexpected number of rows, or previously unseen data types.
Error logs are used when an exception occurs that should halt the execution of a pipeline, such as when data has changed format, or is totally unavailable.
Properly created logs can save Data Engineers time when trying to discover why a pipeline failed, or why results have changed.
5. Handling exceptions with try-except
When building a data pipeline, it's common for errors to occur. The best data pipelines handle common exceptions, and create logs to help debug.
One of the most basic ways to handle these exceptions is by using Python's built-in try-except logic. This functionality allows for code to be run in the "try" block, and if an error occurs, rather than ending execution, code in the except block will be triggered.
6. Handling specific exceptions with try-except
If known, the specific error that may arise should be passed after the except statement. The error name can be aliased using "as alias", and later used when logging.
Here, the transform function attempts to filter the raw_stock_data DataFrame by the column price_change. If present, the column is filtered to only keep rows with a positive price_change value. If a KeyError exception occurs, the error is logged, and a price_change column is created. Then, the updated DataFrame is passed again to the transform function.
Using logging within try-except clauses creates the most fundamental monitoring and alerting solution available in Python, by automatically handling and documenting failures that may arise. From here, more advanced tools can be built into a workflow to provide easier access to execution details.
7. Let's practice!
Great work! It's time to test your monitoring and alerting skills with some practice!