Mutating columns

It may surprise you, but not all datasets start out perfectly clean! Often you have to fix values, or create new columns derived from your existing data. The process of changing or adding columns is called mutation in dplyr terminology, and is performed using mutate(). This function takes a tibble, and named arguments to update columns. The names of each of these arguments is the name of the columns to change or add, and the value is an expression explaining how to update it. For example, given a tibble with columns x and y, the following code would update x and create a new column z.

a_tibble %>%
  mutate(
    x = x + y,
    z = log(x)  
  )

In case you hadn't got the message already that base-R functions don't work with Spark tibbles, you can't use within() or transform() for this purpose.

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Select the title, and duration fields. Note that the durations are in seconds.
Pipe the result of this to mutate() to create a new field, duration_minutes, that contains the track duration in minutes.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Ejercicio

Mutating columns

Instrucciones