1. Learn
  2. /
  3. Courses
  4. /
  5. Introduction to Spark with sparklyr in R

Exercise

Groups: great for music, great for data

A common analysis problem is how to calculate summary statistics for each group of data. For example, you might want to know your sales revenues by month, or by region. In R, the process of splitting up your data into groups, applying a summary statistic on each group, and combining the results into a single data structure, is known as "split-apply-combine". The concept is much older though: SQL has had the GROUP BY statement for decades. The term "map-reduce" is a similar concept, where "map" is very roughly analogous to the "split" and "apply" steps, and "reducing" is "combining". The dplyr/sparklyr approach is to use group_by() before you mutate() or summarize(). It takes the unquoted names of columns to group by. For example, to calculate the mean of column x, for each combination of values in columns grp1 and grp2, you would write the following.

a_tibble %>%
  group_by(grp1, grp2) %>%
  summarize(mean_x = mean(x))

Note that the columns passed to group_by() should typically be categorical variables. For example, if you wanted to calculate the average weight of people relative to their height, it doesn't make sense to group by height, since everyone's height is unique. You could, however, use cut() to convert the heights into different categories, and calculate the mean weight for each category.

Instructions

100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Group the contents of track_metadata by artist_name, then:
    • Summarize the groupwise mean of duration as a new column, mean_duration.
    • Assign the results to duration_by_artist.
  • Find the artists with the shortest songs by arranging the rows in ascending order of mean_duration.
  • Likewise, find those with the longest songs by arranging in descending order of mean_duration.