1. Grouped summary statistics
Now that we know how to group DataFrames and how to calculate summary statistics for a single column, it's time to combine those two together!
2. What we know now
Going back to the US minimal wages DataFrame, it might be interesting to look at the mean or maximum minimum wage over the years.
However, we will get more meaningful information when we look at the mean for different regions. That's where grouped summary statistics come in.
3. What we know know
We could calculate this using filter and the corresponding functions, but that approach would involve a lot of repetition,leading to potential mistakes. A better way to arrive at the desired outcome is to use groupby, followed by combine.
4. Using combine() and groupby()
We can break it down as follows. We first want to groupby the region and then apply the mean function on the effective_min_wage_2020_dollars column.
This gives us the desired average minimum wage in the different US regions as a new DataFrame with two columns, region and effective_min_wage_2020_dollars, where rows correspond to the different groups.
5. Combining combine() and groupby()
Of course, we don't need to save the grouped DataFrame. We can combine the two functions together straight away.
6. Using combine() and groupby()
Similar to what we have seen before, we can change the default name of the resulting column by using another equals-greater sign, followed by the desired new name either as a string or a symbol.
7. Multiple functions on one column
Now, what if we are interested in not only the mean but also the median? Or the maximum?
We can compute all these at once, using the column's name followed by dot-equals-greater sign followed by the names of the functions in square brackets.
8. Multiple functions on one column
If we want to change the default names of the columns, we would add another dot-equals-greater sign and the new names in square brackets.
Note that the dot is necessary here as we deal with vectors.
9. Multiple columns with one function
Similarly, we can use one function on multiple columns by passing the column names as a vector.
Again, the dot in dot-equal-greater sign is necessary, as we get an error without it.
10. Multiple columns with multiple functions
If we want to use several functions on multiple columns at once, we have two options.
We can apply all of the functions on all the columns. In that case, we don't use the comma in between the functions, meaning that we are not passing a vector but rather a one-row matrix of functions.
This results in the calculation of all possible combinations of the column-function pairs.
The second option is to pass the functions in a vector, with commas in between them. Julia then applies the corresponding function to the corresponding column. Be careful to have the same number of columns and functions, though!
11. Possible functions
So, which functions can we use? There are several options.
We can either use the usual statistical functions such as sum, mean, minimum, etc.
You can predefine your own functions and just call them here.
You can also define your own functions in the combine clause, and wrap them in ByRow to be correctly broadcasted.
Or finally, you can use special functions available in the DataFrames package. The two most important ones are nrow and proprow. Nrow returns the number of rows in each group, while proprow returns the proportion of rows per groups against the whole DataFrame.
12. Cheat sheet
That was a lot of information, so here is a little cheat sheet you can refer back to.
13. Let's practice!
Are you ready to combine your knowledge? Let's practice in the exercises!