1. Data transformation using .groupby().transform
Welcome back! In this final chapter, we will focus on the .groupby() family of pandas functions. They will help us group the entries of a DataFrame according to the values of a specific feature.
2. The restaurant dataset
To refresh your memory, we will review when and how to use the .groupby() function.
The dataset we use in this chapter is a collection of people having dinner at a restaurant. For each person, we have various characteristics, including the total amount payed, the tip left to the waiter, the day of the week and the time of the day.
The .groupby() method is applied to a DataFrame and groups it according to a feature. Then, we can apply some simple or more complicated functions on that grouped object.
The simplest method to call is the .count() method. At first, we group the restaurant data according to whether the customer was a smoker or not. Then, we apply the .count() method. We obtain the count of smokers and non-smokers.
It is no surprise that we get the same results for all the features, as the .count() method counts the number of occurrences of each group in each feature. As there are no missing values in our data, the results should be the same in all columns.
3. Data transformation
After grouping the entries of the DataFrame according to the values of a specific feature, we can apply any kind of transformation we are interested in.
Here, we are going to apply the z score, a normalization transformation, which is the distance between each value and the mean, divided by the standard deviation. This is a very useful transformation in statistics, often used with the z-test in standardized testing.
To apply this transformation to the grouped object, we just need to call the .transform() method containing the lambda transformation we defined. This time, we will group according to the type of meal: was it a dinner or a lunch?
As the zscore transformation is a group-related transformation, the resulting table is just the original table. For each element, we subtract the mean and divide by the standard deviation of the group it belongs to.
We can also see that numerical transformations are applied only to numerical features of the DataFrame.
4. Comparison with native methods
While the transform() method simplifies things a lot, is it actually more efficient than using native Python code?
As we did before, we first group our data, this time according to sex. Then we apply the z score transformation we applied before, measuring its efficiency. We omit the code for measuring the time each operation here, as you are already familiar with this.
We can see that with the use of the transform() function, we achieve a massive speed improvement. On top of that, we're only using one line to perform the operation of interest.
5. Let's practice!
I hope you're convinced about the importance of the transform() function in terms both of code cleanliness and efficiency. Try it yourself!