Filtering rows
As well as selecting columns, the other way to extract important parts of your dataset is to filter the rows. This is achieved using the filter()
function. To use filter()
, you pass it a tibble and some logical conditions. For example, to return only the rows where the values of column x
are greater than zero and the values of y
equal the values of z
, you would use the following.
a_tibble %>%
filter(x > 0, y == z)
Before you try the exercise, take heed of two warnings. Firstly, don't mistake dplyr
's filter()
function with the stats
package's filter()
function. Secondly, sparklyr
converts your dplyr
code into SQL database code before passing it to Spark. That means that only a limited number of filtering operations are currently supported. For example, you can't filter character rows using regular expressions with code like
a_tibble %>%
filter(grepl("a regex", x))
The help page for translate_sql()
describes the functionality that is available. You are OK to use comparison operators like >
, !=
, and %in%
; arithmetic operators like +
, ^
, and %%
; and logical operators like &
, |
and !
. Many mathematical functions such as log()
, abs()
, round()
, and sin()
are also supported.
As before, square bracket indexing does not currently work.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
.
- As in the previous exercise, select the
artist_name
,release
,title
, andyear
usingselect()
. - Pipe the result of this to
filter()
to get the tracks from the 1960s.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# track_metadata_tbl has been pre-defined
glimpse(track_metadata_tbl)
# Manipulate the track metadata
track_metadata_tbl %>%
# Select columns
___ %>%
# Filter rows
___