Creating a SparkSession

In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

specify the location of the master node;
name the application (optional); and
retrieve an existing SparkSession or, if there is none, create a new one.

The SparkSession class has a version attribute which gives the version of Spark. Note: The version can also be accessed via the __version__ attribute on the pyspark module.

Find out more about SparkSession here.

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

Notes:

You might find it useful to review the slides from the lessons in the Slides panel next to the IPython Shell.
The version of Spark in the exercise is not the same as in the lessons. The exercise platform has been updated to a more recent version of Spark.

This exercise is part of the course

Machine Learning with PySpark

View Course

Exercise instructions

Import the SparkSession class from pyspark.sql.
Create a SparkSession object connected to a local cluster. Use all available cores. Name the application 'test'.
Use the version attribute on the SparkSession object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time).
Shut down the cluster.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import the SparkSession class
from ____ import ____

# Create SparkSession object
spark = SparkSession.builder \
                    .master(____) \
                    .____(____) \
                    .____()

# What version of Spark?
print(spark.____)

# Terminate the cluster
spark.____()

Edit and Run Code