Get startedGet started for free

Selecting columns

The easiest way to manipulate data frames stored in Spark is to use dplyr syntax. Manipulating data frames using the dplyr syntax is covered in detail in the Data Manipulation with dplyr and Joining Data with dplyr courses, but you'll spend the next chapter and a half covering all the important points.

dplyr has five main actions that you can perform on a data frame. You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics.

Let's start with selecting columns. This is done by calling select(), with a tibble, followed by the unquoted names of the columns you want to keep. dplyr functions are conventionally used with magrittr's pipe operator, %>%. To select the x, y, and z columns, you would write the following.

a_tibble %>%
  select(x, y, z)

Note that square bracket indexing is not currently supported in sparklyr. So you cannot do

a_tibble[, c("x", "y", "z")]

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • Select the artist_name, release, title, and year using select().
  • Try to do the same thing using square bracket indexing. Spoiler! This code throws an error, so it is wrapped in a call to tryCatch().

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  ___

# Try to select columns using [ ]
tryCatch({
    # Selection code here
    ___
  },
  error = print
)
Edit and Run Code