RDD to DataFrame
Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrames is easier than in RDDs. So, understanding of how to convert an RDD to a DataFrame is necessary.
In this exercise, you'll first make an RDD using the sample_list that is already provided to you. This RDD contains a list of tuples ('Mona',20), ('Jennifer',34),('John',20), ('Jim',26) with each tuple containing the name of the person and their age. Next, you'll create a DataFrame using the RDD and schema (which is the list of 'Name' and 'Age') and finally confirm the output is a PySpark DataFrame.
Remember, you already have a SparkContext sc and SparkSession spark available in your workspace.
Latihan ini adalah bagian dari kursus
Big Data Fundamentals with PySpark
Petunjuk latihan
- Create an RDD from the
sample_list. - Create a PySpark DataFrame using the above RDD and schema.
- Confirm the output as PySpark DataFrame.
Latihan interaktif praktis
Cobalah latihan ini dengan menyelesaikan kode contoh berikut.
# Create an RDD from the list
rdd = sc.____(sample_list)
# Create a PySpark DataFrame
names_df = spark.createDataFrame(____, ____=['Name', 'Age'])
# Check the type of names_df
print("The type of names_df is", ____(names_df))