Selecting columns

The easiest way to manipulate data frames stored in Spark is to use dplyr syntax. Manipulating data frames using the dplyr syntax is covered in detail in the Data Manipulation with dplyr and Joining Data with dplyr courses, but you'll spend the next chapter and a half covering all the important points.

dplyr has five main actions that you can perform on a data frame. You can select columns, filter rows, arrange the order of rows, change columns or add new columns, and calculate summary statistics.

Let's start with selecting columns. This is done by calling select(), with a tibble, followed by the unquoted names of the columns you want to keep. dplyr functions are conventionally used with magrittr's pipe operator, %>%. To select the x, y, and z columns, you would write the following.

a_tibble %>%
  select(x, y, z)

Note that square bracket indexing is not currently supported in sparklyr. So you cannot do

a_tibble[, c("x", "y", "z")]

Este ejercicio forma parte del curso

Introduction to Spark with sparklyr in R

Ver curso

Instrucciones del ejercicio

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

Select the artist_name, release, title, and year using select().
Try to do the same thing using square bracket indexing. Spoiler! This code throws an error, so it is wrapped in a call to tryCatch().

Ejercicio interactivo práctico

Prueba este ejercicio y completa el código de muestra.

# track_metadata_tbl has been pre-defined
track_metadata_tbl

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  ___

# Try to select columns using [ ]
tryCatch({
    # Selection code here
    ___
  },
  error = print
)

Editar y ejecutar código