CommencerCommencer gratuitement

Collecting RDDs

For this exercise, you’ll work with both RDDs and DataFrames in PySpark. The goal is to group data and perform aggregation using both RDD operations and DataFrame methods.

You will load a CSV file containing employee salary data into PySpark as an RDD. You'll then group by the experience level data and calculate the maximum salary for each experience level from a DataFrame. By doing this, you'll see the relative strengths of both data formats.

The dataset you're using is related to Data Scientist Salaries, so finding market trends are in your best interests! We've already loaded and normalized the data for you! Remember, there's already a SparkSession called spark in your workspace!

Cet exercice fait partie du cours

Introduction to PySpark

Afficher le cours

Instructions

  • Create an RDD from a DataFrame.
  • Collect and display the results of the RDD and DataFrame.
  • Group by the "experience_level" and calculate the maximum salary for each.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Create an RDD from the df_salaries
rdd_salaries = df_salaries.____

# Collect and print the results
print(rdd_salaries.____)

# Group by the experience level and calculate the maximum salary
dataframe_results = df_salaries.____("experience_level").____({"salary_in_usd": 'max'})

# Show the results
dataframe_results.____
Modifier et exécuter le code