The connect-work-disconnect pattern

Working with sparklyr is very much like working with dplyr when you have data inside a database. In fact, sparklyr converts your R code into SQL code before passing it to Spark.

The typical workflow has three steps:

Connect to Spark using spark_connect().
Do some work.
Close the connection to Spark using spark_disconnect().

In this exercise, you'll do this simplest possible piece of work: returning the version of Spark that is running, using spark_version().

spark_connect() takes a URL that gives the location to Spark. For a local cluster (as you are running), the URL should be "local". For a remote cluster (on another machine, typically a high-performance server), the connection string will be a URL and port to connect on.

spark_version() and spark_disconnect() both take the Spark connection as their only argument.

One word of warning. Connecting to a cluster takes several seconds, so it is impractical to regularly connect and disconnect. While you need to reconnect for each DataCamp exercise, when you incorporate sparklyr into your own workflow, it is usually best to keep the connection open for the whole time that you want to work with Spark.

Load the sparklyr package with library().
Connect to Spark by calling spark_connect(), with argument master = "local". Assign the result to spark_conn.
Get the Spark version using spark_version(), with argument sc = spark_conn.
Disconnect from Spark using spark_disconnect(), with argument sc = spark_conn.

Light My Fire: Starting To Use Spark With dplyr Syntax

Tools of the Trade: Advanced dplyr Usage

Going Native: Use The Native Interface to Manipulate Spark DataFrames

Case Study: Learning to be a Machine: Running Machine Learning Models on Spark

Exercise

The connect-work-disconnect pattern

Instructions