Filtering rows

As well as selecting columns, the other way to extract important parts of your dataset is to filter the rows. This is achieved using the filter() function. To use filter(), you pass it a tibble and some logical conditions. For example, to return only the rows where the values of column x are greater than zero and the values of y equal the values of z, you would use the following.

a_tibble %>%
  filter(x > 0, y == z)

Before you try the exercise, take heed of two warnings. Firstly, don't mistake dplyr's filter() function with the stats package's filter() function. Secondly, sparklyr converts your dplyr code into SQL database code before passing it to Spark. That means that only a limited number of filtering operations are currently supported. For example, you can't filter character rows using regular expressions with code like

a_tibble %>%
  filter(grepl("a regex", x))

The help page for translate_sql() describes the functionality that is available. You are OK to use comparison operators like >, !=, and %in%; arithmetic operators like +, ^, and %%; and logical operators like &, | and !. Many mathematical functions such as log(), abs(), round(), and sin() are also supported.

As before, square bracket indexing does not currently work.

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.

As in the previous exercise, select the artist_name, release, title, and year using select().
Pipe the result of this to filter() to get the tracks from the 1960s.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

Filtering rows

Instructions