ComenzarEmpieza gratis

Summarizing columns

The mutate() function that you saw in the previous exercise takes columns as inputs, and returns a column. If you are calculating summary statistics such as the mean, maximum, or standard deviation, then you typically want to take columns as inputs but return a single value. This is achieved with the summarize() function.

a_tibble %>%
  summarize(
    mean_x       = mean(x),
    sd_x_times_y = sd(x * y)
  )

Note that dplyr has a philosophy (passed on to sparklyr) of always keeping the data in tibbles. So the return value here is a tibble with one row, and one column for each summary statistic that was calculated.

Este ejercicio forma parte del curso

Introduction to Spark with sparklyr in R

Ver curso

Instrucciones del ejercicio

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Select the title, and duration fields.
  • Pipe the result of this to create a new field, duration_minutes, that contains the track duration in minutes.
  • Pipe the result of this to summarize() to calculate the mean duration in minutes, in a field named mean_duration_minutes.

Ejercicio interactivo práctico

Prueba este ejercicio completando el código de muestra.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  ___ %>%
  # Mutate columns
  ___ %>%
  # Summarize columns
  ___
Editar y ejecutar código