RDD to DataFrame
Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrames is easier than in RDDs. So, understanding of how to convert an RDD to a DataFrame is necessary.
In this exercise, you'll first make an RDD using the sample_list
that is already provided to you. This RDD contains a list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26)
with each tuple containing the name of the person and their age. Next, you'll create a DataFrame using the RDD and schema (which is the list of 'Name' and 'Age') and finally confirm the output is a PySpark DataFrame.
Remember, you already have a SparkContext sc
and SparkSession spark
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create an RDD from the
sample_list
. - Create a PySpark DataFrame using the above RDD and schema.
- Confirm the output as PySpark DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create an RDD from the list
rdd = sc.____(sample_list)
# Create a PySpark DataFrame
names_df = spark.createDataFrame(____, ____=['Name', 'Age'])
# Check the type of names_df
print("The type of names_df is", ____(names_df))