Testing Polars Pipelines
1. Testing Polars Pipelines
The Chicago analytics team can now run and write larger Polars pipelines. The last step is making those pipelines safe to change.2. Why test pipelines?
When a pipeline is small, the team can inspect the output by eye, like the incorrect value on the bottom-right here.3. Why test pipelines?
But that doesn't work as the size of their pipeline grows. Let's look at how to write tests to check that pipelines return what they expect.4. Completed requests by department
The team starts with a query they've used before: counting completed requests by department. The first step is to wrap the query in a function called completed_by_department. It takes a LazyFrame called requests as input and returns a transformed LazyFrame.5. Completed requests by department
Inside the function, they start from the requests LazyFrame and build the lazy query.6. Completed requests by department
They group by department and count the number of requests in each group. This is the core of the transformation they want to test.7. Completed requests by department
They also cast the len column to Int32 to reduce memory pressure on the pipeline. Notice that they've forgotten the filter for completed requests. Let's see if the tests catch that.8. Test input
The first step in testing is to create a small sample DataFrame. The sample has just enough rows and columns to test the filter, the group by, and the sort. On the second row, it has a non-completed Status that should be filtered out by the query.9. Expected output
Then the team writes the expected output in a DataFrame. The open Water request should be filtered out, so only the Aviation and Sanitation departments remain. There's another issue here, though. In the function, the team cast len to 32-bit, but the integer column created here will be the default 64-bit dtype. Again, let's let the tests reveal that.10. Getting the actual result
Now they get the actual function output by running completed_by_department on the small test input.11. Getting the actual result
They convert the sample DataFrame to a LazyFrame and collect the result.12. Getting the actual result
The actual result has an Int32 count column, as intended. But it also includes Water, which should have been filtered out in the function as a non-completed request. How will the team become aware of these issues?13. Comparing with equals
The simplest way to check if one DataFrame is the same as another is with the equals method. Here it returns False. This tells the team that something is wrong, but they can't tell what the problem is.14. Importing Polars testing
For stronger tests, Polars provides a testing submodule. The two main helpers are assert_frame_equal and assert_schema_equal. These raise an AssertionError when something is wrong.15. Testing the schema
First, the team tests whether the actual output has the same schema as the expected output. They call assert_schema_equal16. Testing the schema
And pass the actual and expected schemas. This fails and tells them that the dtypes don't match because the query cast len to Int32, but the expected DataFrame has len as Int64. The team sees that there is an issue with how they created the expected DataFrame.17. Fixing the expected schema
They fix the expected DataFrame by ensuring len has an Int32 dtype.18. Testing the schema
Running the schema assertion again now passes with no output. Now they want to test the values as well as the schemas.19. Testing the frame
For this, they use assert_frame_equal.20. Testing the frame
And they pass the actual and expected DataFrames. Again, they get an AssertionError. The output tells them that the actual output has three rows, whereas the expected output has two.21. Comparing actual and expected
They quickly spot the extra Water row and realize they forgot the completed-request filter.22. Fixing the query
They fix the query by adding the missing filter. Then they re-create the actual output.23. Final assertions
Now the checks all pass. The team sees how the tests protect against both kinds of mistake: dtype drift and incorrect rows.24. Let's practice!
Now let's practice testing Polars pipelines.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.