The connect-work-disconnect pattern
Working with sparklyr
is very much like working with dplyr
when you have data inside a database. In fact, sparklyr
converts your R code into SQL code before passing it to Spark.
The typical workflow has three steps:
- Connect to Spark using
spark_connect()
. - Do some work.
- Close the connection to Spark using
spark_disconnect()
.
In this exercise, you'll do this simplest possible piece of work: returning the version of Spark that is running, using spark_version()
.
spark_connect()
takes a URL that gives the location to Spark. For a local cluster (as you are running), the URL should be "local"
. For a remote cluster (on another machine, typically a high-performance server), the connection string will be a URL and port to connect on.
spark_version()
and spark_disconnect()
both take the Spark connection as their only argument.
One word of warning. Connecting to a cluster takes several seconds, so it is impractical to regularly connect and disconnect. While you need to reconnect for each DataCamp exercise, when you incorporate sparklyr
into your own workflow, it is usually best to keep the connection open for the whole time that you want to work with Spark.
This exercise is part of the course
Introduction to Spark with sparklyr in R
Exercise instructions
- Load the
sparklyr
package withlibrary()
. - Connect to Spark by calling
spark_connect()
, with argumentmaster = "local"
. Assign the result tospark_conn
. - Get the Spark version using
spark_version()
, with argumentsc = spark_conn
. - Disconnect from Spark using
spark_disconnect()
, with argumentsc = spark_conn
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load sparklyr
___
# Connect to your Spark cluster
spark_conn <- ___
# Print the version of Spark
___
# Disconnect from Spark
___