Aan de slagGa gratis aan de slag

Creating a SparkSession

In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a SparkSession object.

The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

  • specify the location of the master node;
  • name the application (optional); and
  • retrieve an existing SparkSession or, if there is none, create a new one.

The SparkSession class has a version attribute which gives the version of Spark. Note: The version can also be accessed via the __version__ attribute on the pyspark module.

Find out more about SparkSession here.

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

Notes:

  1. You might find it useful to review the slides from the lessons in the Slides panel next to the IPython Shell.
  2. The version of Spark in the exercise is not the same as in the lessons. The exercise platform has been updated to a more recent version of Spark.

Deze oefening maakt deel uit van de cursus

Machine Learning with PySpark

Cursus bekijken

Oefeninstructies

  • Import the SparkSession class from pyspark.sql.
  • Create a SparkSession object connected to a local cluster. Use all available cores. Name the application 'test'.
  • Use the version attribute on the SparkSession object to retrieve the version of Spark running on the cluster. Note: The version might be different to the one that's used in the presentation (it gets updated from time to time).
  • Shut down the cluster.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Import the SparkSession class
from ____ import ____

# Create SparkSession object
spark = SparkSession.builder \
                    .master(____) \
                    .____(____) \
                    .____()

# What version of Spark?
print(spark.____)

# Terminate the cluster
spark.____()
Code bewerken en uitvoeren