Get startedGet started for free

Advanced Selection II: The SQL

As previously mentioned, when you use the dplyr interface, sparklyr converts your code into SQL before passing it to Spark. Most of the time, this is what you want. However, you can also write raw SQL to accomplish the same task. Most of the time, this is a silly idea since the code is harder to write and harder to debug. However, if you want your code to be portable – that is, used outside of R as well – then it may be useful. For example, a fairly common workflow is to use sparklyr to experiment with data processing, then switch to raw SQL in a production environment. By writing raw SQL to begin with, you can just copy and paste your queries when you move to production.

SQL queries are written as strings, and passed to dbGetQuery() from the DBI package. The pattern is as follows.

query <- "SELECT col1, col2 FROM some_data WHERE some_condition"
a_data.frame <- dbGetQuery(spark_conn, query)

Note that unlike the dplyr code you've written, dbGetQuery() will always execute the query and return the results to R immediately. If you want to delay returning the data, you can use dbSendQuery() to execute the query, then dbFetch() to return the results. That's more advanced usage, not covered here. Also note that DBI functions return data.frames rather than tibbles, since DBI is a lower-level package.

If you want to learn more about writing SQL code, take the Introduction to SQL course.

This exercise is part of the course

Introduction to Spark with sparklyr in R

View Course

Exercise instructions

A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl, and can be accessed in SQL queries at track_metadata.

  • Complete the query to select all columns from the track_metadata Spark data frame where the year is less than 1935 and the duration is greater than 300 seconds.
  • Call dbGetQuery() to execute the query, assigning the results to results, then view the output.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Write SQL query
query <- "SELECT ___ FROM ___ WHERE ___ AND ___"

# Run the query
(results <- ___(___, ___))
Edit and Run Code