Grouping and Aggregating I
Part of what makes aggregating so powerful is the addition of groups. PySpark has a whole class devoted to grouped data frames: pyspark.sql.GroupedData, which you saw in the last two exercises.
You've learned how to create a grouped DataFrame by calling the .groupBy() method on a DataFrame with no arguments.
Now you'll see that when you pass the name of one or more columns in your DataFrame to the .groupBy() method, the aggregation methods behave like when you use a GROUP BY statement in a SQL query!
Remember, a SparkSession called spark is already in your workspace, along with the Spark DataFrame flights.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Create a DataFrame called
by_planethat is grouped by the columntailnum. - Use the
.count()method with no arguments to count the number of flights each plane made. - Create a DataFrame called
by_originthat is grouped by the columnorigin. - Find the
.avg()of theair_timecolumn to find average duration of flights from PDX and SEA.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Group by tailnum
by_plane = flights.groupBy("____")
# Number of flights each plane made
by_plane.____.show()
# Group by origin
by_origin = flights.groupBy("____")
# Average duration of flights from PDX and SEA
by_origin.avg("____").show()