Mutating columns
It may surprise you, but not all datasets start out perfectly clean! Often you have to fix values, or create new columns derived from your existing data. The process of changing or adding columns is called mutation in dplyr
terminology, and is performed using mutate()
. This function takes a tibble, and named arguments to update columns. The names of each of these arguments is the name of the columns to change or add, and the value is an expression explaining how to update it. For example, given a tibble with columns x
and y
, the following code would update x
and create a new column z
.
a_tibble %>%
mutate(
x = x + y,
z = log(x)
)
In case you hadn't got the message already that base-R functions don't work with Spark tibbles, you can't use within()
or transform()
for this purpose.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Select the
title
, andduration
fields. Note that the durations are in seconds. - Pipe the result of this to
mutate()
to create a new field,duration_minutes
, that contains the track duration in minutes.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Manipulate the track metadata
track_metadata_tbl %>%
# Select columns
___ %>%
# Mutate columns
___