Get startedGet started for free

RDD to DataFrame

Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrames is easier than in RDDs. So, understanding of how to convert an RDD to a DataFrame is necessary.

In this exercise, you'll first make an RDD using the sample_list that is already provided to you. This RDD contains a list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26) with each tuple containing the name of the person and their age. Next, you'll create a DataFrame using the RDD and schema (which is the list of 'Name' and 'Age') and finally confirm the output is a PySpark DataFrame.

Remember, you already have a SparkContext sc and SparkSession spark available in your workspace.

This exercise is part of the course

Big Data Fundamentals with PySpark

View Course

Exercise instructions

  • Create an RDD from the sample_list.
  • Create a PySpark DataFrame using the above RDD and schema.
  • Confirm the output as PySpark DataFrame.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create an RDD from the list
rdd = sc.____(sample_list)

# Create a PySpark DataFrame
names_df = spark.createDataFrame(____, ____=['Name', 'Age'])

# Check the type of names_df
print("The type of names_df is", ____(names_df))
Edit and Run Code