Get startedGet started for free

Testing Polars Pipelines

1. Testing Polars Pipelines

The Chicago analytics team can now run and write larger Polars pipelines. The last step is making those pipelines safe to change.

2. Why test pipelines?

When a pipeline is small, the team can inspect the output by eye, like the incorrect value on the bottom-right here.

3. Why test pipelines?

But that doesn't work as the size of their pipeline grows. Let's look at how to write tests to check that pipelines return what they expect.

4. Completed requests by department

The team starts with a query they've used before: counting completed requests by department. The first step is to wrap the query in a function called completed_by_department. It takes a LazyFrame called requests as input and returns a transformed LazyFrame.

5. Completed requests by department

Inside the function, they start from the requests LazyFrame and build the lazy query.

6. Completed requests by department

They group by department and count the number of requests in each group. This is the core of the transformation they want to test.

7. Completed requests by department

They also cast the len column to Int32 to reduce memory pressure on the pipeline. Notice that they've forgotten the filter for completed requests. Let's see if the tests catch that.

8. Test input

The first step in testing is to create a small sample DataFrame. The sample has just enough rows and columns to test the filter, the group by, and the sort. On the second row, it has a non-completed Status that should be filtered out by the query.

9. Expected output

Then the team writes the expected output in a DataFrame. The open Water request should be filtered out, so only the Aviation and Sanitation departments remain. There's another issue here, though. In the function, the team cast len to 32-bit, but the integer column created here will be the default 64-bit dtype. Again, let's let the tests reveal that.

10. Getting the actual result

Now they get the actual function output by running completed_by_department on the small test input.

11. Getting the actual result

They convert the sample DataFrame to a LazyFrame and collect the result.

12. Getting the actual result

The actual result has an Int32 count column, as intended. But it also includes Water, which should have been filtered out in the function as a non-completed request. How will the team become aware of these issues?

13. Comparing with equals

The simplest way to check if one DataFrame is the same as another is with the equals method. Here it returns False. This tells the team that something is wrong, but they can't tell what the problem is.

14. Importing Polars testing

For stronger tests, Polars provides a testing submodule. The two main helpers are assert_frame_equal and assert_schema_equal. These raise an AssertionError when something is wrong.

15. Testing the schema

First, the team tests whether the actual output has the same schema as the expected output. They call assert_schema_equal

16. Testing the schema

And pass the actual and expected schemas. This fails and tells them that the dtypes don't match because the query cast len to Int32, but the expected DataFrame has len as Int64. The team sees that there is an issue with how they created the expected DataFrame.

17. Fixing the expected schema

They fix the expected DataFrame by ensuring len has an Int32 dtype.

18. Testing the schema

Running the schema assertion again now passes with no output. Now they want to test the values as well as the schemas.

19. Testing the frame

For this, they use assert_frame_equal.

20. Testing the frame

And they pass the actual and expected DataFrames. Again, they get an AssertionError. The output tells them that the actual output has three rows, whereas the expected output has two.

21. Comparing actual and expected

They quickly spot the extra Water row and realize they forgot the completed-request filter.

22. Fixing the query

They fix the query by adding the missing filter. Then they re-create the actual output.

23. Final assertions

Now the checks all pass. The team sees how the tests protect against both kinds of mistake: dtype drift and incorrect rows.

24. Let's practice!

Now let's practice testing Polars pipelines.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.