Reading a CSV and performing aggregations
You have a spreadsheet of Data Scientist salaries from companies ranging is size from small to large. You want to see if there is a major difference between average salaries grouped by company size.
Remember, there's already a SparkSession
called spark
in your workspace!
This exercise is part of the course
Introduction to PySpark
Exercise instructions
- Load a csv file as a DataFrame and infer the schema.
- Return the count of the number of rows.
- Group by the column
company_size
and calculate the average salary withsalary_in_usd
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the CSV file into a DataFrame
salaries_df = ____("salaries.csv", header=True, inferSchema=____)
# Count the total number of rows
row_count = salaries_df.____
print(f"Total rows: {row_count}")
# Group by company size and calculate the average of salaries
salaries_df.____("company_size").____({"salary_in_usd": "avg"}).show()
salaries_df.show()