Using DataFrames
Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.
The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.
When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!
To start working with Spark DataFrames, you first have to create a SparkSession
object from your SparkContext
. You can think of the SparkContext
as your connection to the cluster and the SparkSession
as your interface with that connection.
Remember, for the rest of this course you'll have a SparkSession
called spark
available in your workspace!
Which of the following is an advantage of Spark DataFrames over RDDs?
This exercise is part of the course
Foundations of PySpark
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
