CommencerCommencer gratuitement

Reading a CSV and performing aggregations

You have a spreadsheet of Data Scientist salaries from companies ranging is size from small to large. You want to see if there is a major difference between average salaries grouped by company size.

Remember, there's already a SparkSession called spark in your workspace!

Cet exercice fait partie du cours

Introduction to PySpark

Afficher le cours

Instructions

  • Load a csv file as a DataFrame and infer the schema.
  • Return the count of the number of rows.
  • Group by the column company_size and calculate the average salary with salary_in_usd.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Load the CSV file into a DataFrame
salaries_df = ____("salaries.csv", header=True, inferSchema=____)

# Count the total number of rows
row_count = salaries_df.____
print(f"Total rows: {row_count}")

# Group by company size and calculate the average of salaries
salaries_df.____("company_size").____({"salary_in_usd": "avg"}).show()
salaries_df.show()
Modifier et exécuter le code