Session Ready
Exercise

RDD to DataFrame

Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrame is easier than RDD most of the time and so understanding of how to convert RDD to DataFrame is necessary.

In this exercise, you'll first make an RDD using the sample_list which contains the list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26) with each tuple contains the name of the person and their age. Next, you'll create a DataFrame using the RDD and the schema (which is the list of 'Name' and 'Age') and finally confirm the output as PySpark DataFrame.

Remember, you already have a SparkContext sc and SparkSession spark available in your workspace.

Instructions
100 XP
  • Create a sample_list from tuples - ('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26).
  • Create an RDD from the sample_list.
  • Create a PySpark DataFrame using the above RDD and schema.
  • Confirm the output as PySpark DataFrame.