Session Ready
Exercise

Selecting columns

The easiest way to manipulate data frames stored in Spark is to use dplyr syntax. Manipulating data frames using the dplyr syntax is covered in detail in the Data Manipulation in R with dplyr and Joining Data in R with dplyr courses, but you'll spend the next chapter and a half covering all the important points.

dplyr has five main actions that you can perform on a data frame. You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics.

Let's start with selecting columns. This is done by calling select(), with a tibble, followed by the unquoted names of the columns you want to keep. dplyr functions are conventionally used with magrittr's pipe operator, %>%. To select the x, y, and z columns, you would write the following.

a_tibble %>%
  select(x, y, z)

Note that square bracket indexing is not currently supported in sparklyr. So you cannot do

a_tibble[, c("x", "y", "z")]
Instructions
100 XP

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Select the artist_name, release, title, and year using select().
  • Try to do the same thing using square bracket indexing. Spoiler! This code throws an error, so it is wrapped in a call to tryCatch().