Unit-testing a data pipeline

1. Unit-testing a data pipeline

In addition to checkpoint validation and end-to-end testing, unit tests are a great way to test the performance of a data pipeline.

2. Validating a data pipeline with unit tests

Unit tests are code that helps to test the functionality of other code and are commonly used in software engineering workflows. Unit tests are the foundation for code validation, and can be used by data engineers to ensure components of a data pipeline work as expected. Unit tests can also be written to validate data produced by a pipeline. In a typical data pipeline workflow, unit tests will be written and run before end-to-end validation is completed, to validate both the code and resulting data.

3. pytest for unit testing

To build and run unit tests with Python, we'll be using a library called pytest. With pytest, unit tests are written as functions. Typically, these function names start with "test", which allows pytest to automatically parse and run tests within a project. In this example, we define a function test_transformed_data. This function asserts that the clean_stock_data object indeed takes type pd-dot-DataFrame. When the command python dash-m pytest is executed, this test will be parsed and run. If no AssertionErrors are raised, a success message will be output. Let's take a closer look at the isinstance function, as well as the assert keyword.

4. assert and isinstance

To check the object's type, we'll use the isinstance function. isinstance takes two arguments: an object and a data type. If the object matches the data type, the function returns True. Otherwise, isinstance will return False. Here, "ETL" is assigned to the pipeline_type variable, and isinstance returns True when called, since pipeline_type takes the type string. The assert keyword validates that a boolean expression is indeed True, and raises an AssertionError otherwise. Here, we validate that pipeline_type indeed takes the value "ETL". Since the statement evaluates to True, no error is raised. When writing unit tests, we'll use assert and isinstance together to validate the type that objects take.

5. AssertionError

In this example, "ETL" is again assigned to the pipeline_type variable. This time, we attempt to assert that this object is a float. Since this is False, an AssertionError is raised, as shown here. If this statement were placed within a unit test, the test would fail when run.

6. Mocking data pipeline components with fixtures

pytest fixtures are functions that allow test data and objects to be shared across multiple tests. They can be used to simplify test setup, and provide a common set of test data for multiple tests. In this example, we create a fixture called clean_data, which returns a cleaned DataFrame. The fixture is then passed to the test_transformed_data function. When run, this unit test will be able to access the cleaned DataFrame created and returned by the clean_data fixture. We'll explore fixtures more in the following exercises.

7. Unit testing DataFrames

In addition to testing functions, we can also test data. In this example, we'll test the clean_data DataFrame passed into the test as a fixture. Using the dot-columns attribute, we assert that there are four columns in this DataFrame. We can use other built-in tools, such as dot-min, to assert that all values in the open column take value greater than zero. This can be taken one step further by validating the max value of this column with the dot-max method. Running unit tests against data helps to confirm that data follows business rules and requirements, and can help to catch data quality issues before a pipeline is shipped to production.

8. Let's practice!

Time to explore unit testing further with some hands-on practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.