Advanced Selection II: The SQL
As previously mentioned, when you use the dplyr
interface, sparklyr
converts your code into SQL before passing it to Spark. Most of the time, this is what you want. However, you can also write raw SQL to accomplish the same task. Most of the time, this is a silly idea since the code is harder to write and harder to debug. However, if you want your code to be portable – that is, used outside of R as well – then it may be useful. For example, a fairly common workflow is to use sparklyr
to experiment with data processing, then switch to raw SQL in a production environment. By writing raw SQL to begin with, you can just copy and paste your queries when you move to production.
SQL queries are written as strings, and passed to dbGetQuery()
from the DBI package. The pattern is as follows.
query <- "SELECT col1, col2 FROM some_data WHERE some_condition"
a_data.frame <- dbGetQuery(spark_conn, query)
Note that unlike the dplyr
code you've written, dbGetQuery()
will always execute the query and return the results to R immediately. If you want to delay returning the data, you can use dbSendQuery()
to execute the query, then dbFetch()
to return the results. That's more advanced usage, not covered here. Also note that DBI
functions return data.frame
s rather than tibble
s, since DBI
is a lower-level package.
If you want to learn more about writing SQL code, take the Introduction to SQL course.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
A Spark connection has been created for you as spark_conn
. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl
, and can be accessed in SQL queries at track_metadata
.
- Complete the query to select all columns from the
track_metadata
Spark data frame where theyear
is less than 1935 and theduration
is greater than 300 seconds. - Call
dbGetQuery()
to execute the query, assigning the results toresults
, then view the output.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Write SQL query
query <- "SELECT ___ FROM ___ WHERE ___ AND ___"
# Run the query
(results <- ___(___, ___))