Aan de slagGa gratis aan de slag

Collecting RDDs

For this exercise, you’ll work with both RDDs and DataFrames in PySpark. The goal is to group data and perform aggregation using both RDD operations and DataFrame methods.

You will load a CSV file containing employee salary data into PySpark as an RDD. You'll then group by the experience level data and calculate the maximum salary for each experience level from a DataFrame. By doing this, you'll see the relative strengths of both data formats.

The dataset you're using is related to Data Scientist Salaries, so finding market trends are in your best interests! We've already loaded and normalized the data for you! Remember, there's already a SparkSession called spark in your workspace!

Deze oefening maakt deel uit van de cursus

Introduction to PySpark

Cursus bekijken

Oefeninstructies

  • Create an RDD from a DataFrame.
  • Collect and display the results of the RDD and DataFrame.
  • Group by the "experience_level" and calculate the maximum salary for each.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

# Create an RDD from the df_salaries
rdd_salaries = df_salaries.____

# Collect and print the results
print(rdd_salaries.____)

# Group by the experience level and calculate the maximum salary
dataframe_results = df_salaries.____("experience_level").____({"salary_in_usd": 'max'})

# Show the results
dataframe_results.____
Code bewerken en uitvoeren