Get startedGet started for free

Filtering rows

As well as selecting columns, the other way to extract important parts of your dataset is to filter the rows. This is achieved using the filter() function. To use filter(), you pass it a tibble and some logical conditions. For example, to return only the rows where the values of column x are greater than zero and the values of y equal the values of z, you would use the following.

a_tibble %>%
  filter(x > 0, y == z)

Before you try the exercise, take heed of two warnings. Firstly, don't mistake dplyr's filter() function with the stats package's filter() function. Secondly, sparklyr converts your dplyr code into SQL database code before passing it to Spark. That means that only a limited number of filtering operations are currently supported. For example, you can't filter character rows using regular expressions with code like

a_tibble %>%
  filter(grepl("a regex", x))

The help page for translate_sql() describes the functionality that is available. You are OK to use comparison operators like >, !=, and %in%; arithmetic operators like +, ^, and %%; and logical operators like &, | and !. Many mathematical functions such as log(), abs(), round(), and sin() are also supported.

As before, square bracket indexing does not currently work.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

  • As in the previous exercise, select the artist_name, release, title, and year using select().
  • Pipe the result of this to filter() to get the tracks from the 1960s.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# track_metadata_tbl has been pre-defined
glimpse(track_metadata_tbl)

# Manipulate the track metadata
track_metadata_tbl %>%
  # Select columns
  ___ %>%
  # Filter rows
  ___
Edit and Run Code