Get startedGet started for free

Collecting RDDs

For this exercise, you’ll work with both RDDs and DataFrames in PySpark. The goal is to group data and perform aggregation using both RDD operations and DataFrame methods.

You will load a CSV file containing employee salary data into PySpark as an RDD. You'll then group by the experience level data and calculate the maximum salary for each experience level from a DataFrame. By doing this, you'll see the relative strengths of both data formats.

The dataset you're using is related to Data Scientist Salaries, so finding market trends are in your best interests! We've already loaded and normalized the data for you! Remember, there's already a SparkSession called spark in your workspace!

This exercise is part of the course

Introduction to PySpark

View Course

Exercise instructions

  • Create an RDD from a DataFrame.
  • Collect and display the results of the RDD and DataFrame.
  • Group by the "experience_level" and calculate the maximum salary for each.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Create an RDD from the df_salaries
rdd_salaries = df_salaries.____

# Collect and print the results
print(rdd_salaries.____)

# Group by the experience level and calculate the maximum salary
dataframe_results = df_salaries.____("experience_level").____({"salary_in_usd": 'max'})

# Show the results
dataframe_results.____
Edit and Run Code