Selecting columns
The easiest way to manipulate data frames stored in Spark is to use dplyr
syntax. Manipulating data frames using the dplyr
syntax is covered in detail in the Data Manipulation with dplyr and Joining Data with dplyr courses, but you'll spend the next chapter and a half covering all the important points.
dplyr
has five main actions that you can perform on a data frame. You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics.
Let's start with selecting columns. This is done by calling select()
, with a tibble, followed by the unquoted names of the columns you want to keep. dplyr
functions are conventionally used with magrittr
's pipe operator, %>%
. To select the x
, y
, and z
columns, you would write the following.
a_tibble %>%
select(x, y, z)
Note that square bracket indexing is not currently supported in sparklyr
. So you cannot do
a_tibble[, c("x", "y", "z")]
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- Select the
artist_name
,release
,title
, andyear
usingselect()
. - Try to do the same thing using square bracket indexing. Spoiler! This code throws an error, so it is wrapped in a call to
tryCatch()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
track_metadata_tbl
# Manipulate the track metadata
track_metadata_tbl %>%
# Select columns
___
# Try to select columns using [ ]
tryCatch({
# Selection code here
___
},
error = print
)