Groups: great for music, great for data
A common analysis problem is how to calculate summary statistics for each group of data. For example, you might want to know your sales revenues by month, or by region. In R, the process of splitting up your data into groups, applying a summary statistic on each group, and combining the results into a single data structure, is known as "split-apply-combine". The concept is much older though: SQL has had the GROUP BY
statement for decades. The term "map-reduce" is a similar concept, where "map" is very roughly analogous to the "split" and "apply" steps, and "reducing" is "combining". The dplyr
/sparklyr
approach is to use group_by()
before you mutate()
or summarize()
. It takes the unquoted names of columns to group by. For example, to calculate the mean of column x
, for each combination of values in columns grp1
and grp2
, you would write the following.
a_tibble %>%
group_by(grp1, grp2) %>%
summarize(mean_x = mean(x))
Note that the columns passed to group_by()
should typically be categorical variables. For example, if you wanted to calculate the average weight of people relative to their height, it doesn't make sense to group by height, since everyone's height is unique. You could, however, use cut()
to convert the heights into different categories, and calculate the mean weight for each category.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Group the contents of
track_metadata
byartist_name
, then:- Summarize the groupwise mean of
duration
as a new column,mean_duration
. - Assign the results to
duration_by_artist
.
- Summarize the groupwise mean of
- Find the artists with the shortest songs by arranging the rows in ascending order of
mean_duration
. - Likewise, find those with the longest songs by arranging in descending order of
mean_duration
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
duration_by_artist <- track_metadata_tbl %>%
# Group by artist
___ %>%
# Calc mean duration
___
duration_by_artist %>%
# Sort by ascending mean duration
___
duration_by_artist %>%
# Sort by descending mean duration
___