Summarizing columns
The mutate()
function that you saw in the previous exercise takes columns as inputs, and returns a column. If you are calculating summary statistics such as the mean, maximum, or standard deviation, then you typically want to take columns as inputs but return a single value. This is achieved with the summarize()
function.
a_tibble %>%
summarize(
mean_x = mean(x),
sd_x_times_y = sd(x * y)
)
Note that dplyr
has a philosophy (passed on to sparklyr
) of always keeping the data in tibbles. So the return value here is a tibble with one row, and one column for each summary statistic that was calculated.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Select the
title
, andduration
fields. - Pipe the result of this to create a new field,
duration_minutes
, that contains the track duration in minutes. - Pipe the result of this to
summarize()
to calculate the mean duration in minutes, in a field namedmean_duration_minutes
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Manipulate the track metadata
track_metadata_tbl %>%
# Select columns
___ %>%
# Mutate columns
___ %>%
# Summarize columns
___