Get startedGet started for free

Groups: great for music, great for data

A common analysis problem is how to calculate summary statistics for each group of data. For example, you might want to know your sales revenues by month, or by region. In R, the process of splitting up your data into groups, applying a summary statistic on each group, and combining the results into a single data structure, is known as "split-apply-combine". The concept is much older though: SQL has had the GROUP BY statement for decades. The term "map-reduce" is a similar concept, where "map" is very roughly analogous to the "split" and "apply" steps, and "reducing" is "combining". The dplyr/sparklyr approach is to use group_by() before you mutate() or summarize(). It takes the unquoted names of columns to group by. For example, to calculate the mean of column x, for each combination of values in columns grp1 and grp2, you would write the following.

a_tibble %>%
  group_by(grp1, grp2) %>%
  summarize(mean_x = mean(x))

Note that the columns passed to group_by() should typically be categorical variables. For example, if you wanted to calculate the average weight of people relative to their height, it doesn't make sense to group by height, since everyone's height is unique. You could, however, use cut() to convert the heights into different categories, and calculate the mean weight for each category.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Group the contents of track_metadata by artist_name, then:
    • Summarize the groupwise mean of duration as a new column, mean_duration.
    • Assign the results to duration_by_artist.
  • Find the artists with the shortest songs by arranging the rows in ascending order of mean_duration.
  • Likewise, find those with the longest songs by arranging in descending order of mean_duration.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

duration_by_artist <- track_metadata_tbl %>%
  # Group by artist
  ___ %>%
  # Calc mean duration
  ___

duration_by_artist %>%
  # Sort by ascending mean duration
  ___

duration_by_artist %>%
  # Sort by descending mean duration
  ___
Edit and Run Code