RDD to DataFrame

Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrames is easier than in RDDs. So, understanding of how to convert an RDD to a DataFrame is necessary.

In this exercise, you'll first make an RDD using the sample_list that is already provided to you. This RDD contains a list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26) with each tuple containing the name of the person and their age. Next, you'll create a DataFrame using the RDD and schema (which is the list of 'Name' and 'Age') and finally confirm the output is a PySpark DataFrame.

Remember, you already have a SparkContext sc and SparkSession spark available in your workspace.

Create an RDD from the sample_list.
Create a PySpark DataFrame using the above RDD and schema.
Confirm the output as PySpark DataFrame.

script.py

IPython Shell

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

Exercise

Exercise

RDD to DataFrame

Instructions