Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession
object.
The SparkSession
class has a builder
attribute, which is an instance of the Builder
class. The Builder
class exposes three important methods that let you:
- specify the location of the master node;
- name the application (optional); and
- retrieve an existing
SparkSession
or, if there is none, create a new one.
The SparkSession
class has a version
attribute which gives the version of Spark. Note: The version can also be accessed via the __version__
attribute on the pyspark
module.
Find out more about SparkSession
here.
Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.
Notes:
- You might find it useful to review the slides from the lessons in the Slides panel next to the IPython Shell.
- The version of Spark in the exercise is not the same as in the lessons. The exercise platform has been updated to a more recent version of Spark.
This exercise is part of the course
Machine Learning with PySpark
Exercise instructions
- Import the
SparkSession
class frompyspark.sql
. - Create a
SparkSession
object connected to a local cluster. Use all available cores. Name the application'test'
. - Use the
version
attribute on theSparkSession
object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time). - Shut down the cluster.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the SparkSession class
from ____ import ____
# Create SparkSession object
spark = SparkSession.builder \
.master(____) \
.____(____) \
.____()
# What version of Spark?
print(spark.____)
# Terminate the cluster
spark.____()