Collecting RDDs
For this exercise, you’ll work with both RDDs and DataFrames in PySpark. The goal is to group data and perform aggregation using both RDD operations and DataFrame methods.
You will load a CSV file containing employee salary data into PySpark as an RDD. You'll then group by the experience level data and calculate the maximum salary for each experience level from a DataFrame. By doing this, you'll see the relative strengths of both data formats.
The dataset you're using is related to Data Scientist Salaries, so finding market trends are in your best interests! We've already loaded and normalized the data for you! Remember, there's already a SparkSession
called spark
in your workspace!
This exercise is part of the course
Introduction to PySpark
Exercise instructions
- Create an RDD from a DataFrame.
- Collect and display the results of the RDD and DataFrame.
- Group by the
"experience_level"
and calculate the maximum salary for each.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create an RDD from the df_salaries
rdd_salaries = df_salaries.____
# Collect and print the results
print(rdd_salaries.____)
# Group by the experience level and calculate the maximum salary
dataframe_results = df_salaries.____("experience_level").____({"salary_in_usd": 'max'})
# Show the results
dataframe_results.____