1. Practical examples
Now, we will sum up the course with some practical examples.
2. Data and pipeline
We will be working with the dataset that contains data science salaries. Let's assume somebody gave us the data pipeline implemented in Python and asked us to create and run tests on it before deployment. The pipeline is simple: first, we read the data, then filter it by employment type and calculate the mean salary. Finally, we save the result to the file.
3. Code of the pipeline
The code is quite straightforward. We have the "read df" fixture to prepare the environment, which is "reading the data" in our case. And we have the two functions: to filter the data and to get the mean. What tests should we create to make sure everything works properly?
4. Integration tests
Let's start with integration tests. We know that the pipeline reads and writes the data to the file system. Hence, we have to make sure that the integration between Python and the file system works. There will be two test cases: reading the data and writing to the file.
In this first snippet, we are validating that Python can read the DataFrame correctly by checking its type and shape.
5. Integration tests
To check that Python can create files, we are going to create a temporary text file named "temp txt". When opening "temp txt" in the "w" mode, Python creates an empty file. Then, we just write something into the file and make sure it exists. Finally, we remove it to keep the file system clean.
6. Unit tests
Now, let's move to the unit tests. Here, we want to ensure that the smallest pieces work as expected. If we filter the dataset properly, it will contain only "FT" employment type. That is our first test case. The second one is to check that the mean is a float number. Our code contains a function with the two unit tests. First, we get the filtered dataset. Then, we check its employment type unique values using the "unique" method of "panda DataFrame". Finally, we check that the mean has a float type by using the Python built-in "isinstance" method.
7. Feature tests
Feature tests help ensure that our software users will get what they want. In our case, the feature is getting the final result - the mean salary. We know it has to be greater than zero and not bigger than the maximum salary. Otherwise, it would not make any sense. This part of the code is another test function. It contains data filtering and the two assertions about the mean value. We use the pandas' "max" method to get the maximum salary in the initial dataset.
8. Performance tests
Finally, we have to make sure that the pipeline finishes within a reasonable amount of time. We will use pytest-benchmark package for that. The benchmark decorator makes it possible to create a function on the fly. We named that function "get_result", and it contains two steps. To filter the data and to get the mean. Pytest benchmark will measure everything inside the pipeline function.
9. Final test suite
That is what the final test suite looks like as a whole. That code does not contain the pipeline itself, only the tests. Sometimes, it looks much bigger in real-life projects. We used the integration tests to check the connection with the file system. We implemented the unit tests to validate the smallest parts of the pipeline. We also created the feature tests to make sure the users would get what they wanted. And finally, the performance tests will help us test the speed of the pipeline.
10. Let's practice!
Now, make your own test suites!