Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.
The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:
- specify the location of the master node;
- name the application (optional); and
- retrieve an existing
SparkSessionor, if there is none, create a new one.
The SparkSession class has a version attribute which gives the version of Spark. Note: The version can also be accessed via the __version__ attribute on the pyspark module.
Find out more about SparkSession here.
Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.
Notes:
- You might find it useful to review the slides from the lessons in the Slides panel next to the IPython Shell.
- The version of Spark in the exercise is not the same as in the lessons. The exercise platform has been updated to a more recent version of Spark.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the
SparkSessionclass frompyspark.sql. - Create a
SparkSessionobject connected to a local cluster. Use all available cores. Name the application'test'. - Use the
versionattribute on theSparkSessionobject to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time). - Shut down the cluster.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the SparkSession class
from ____ import ____
# Create SparkSession object
spark = SparkSession.builder \
.master(____) \
.____(____) \
.____()
# What version of Spark?
print(spark.____)
# Terminate the cluster
spark.____()